Optimize storage of the blockchain data #87

Robyer · 2022-08-03T15:42:46Z

I think there are 2 main problems with the blockchain data storage right now:

enormous amount of files
extremely small size of these files

Filesystem has some cluster size ( = as a smallest unit of data that can be addressed on the disk) and if your files are smaller than that then the remaining space is wasted as it can't be used for anything else. For example NTFS has default size of 4 KB. Most of the files in Dero mainnet folder has only 101 bytes (I think that is single transaction or single block with no transactions in it). That means that every such file that Dero saves is wasting 3995 bytes from that 4 KB, which is 97,5 % of wasted space for these files.

Not mentioning fragmentation and slow access speed (for reading, writing, deleting) as opposite when you have one large file.

Also this high amount of separate files is also a problem itself, as the number of files the filesystem can hold is also limited.

I see that in mainnet/balances folder are larger files - each having 2 GB. I don't know what these files are (wallets data?), but this is much more effective / ideal approach. You don't have 1 file per wallet either.

Possible solution: Don't save each block/transaction as single file, but combine multiple blocks into single file. It could be even defined by some constant in code, how many transactions combine together. So if needed, advanced user can modify it according to his needs. Or it could be made dynamic, combining less blocks when there are more transactions.

E.g. combine each 1000 blocks into single file. Right now Dero has 734252 blocks, so it would be 735 files. Right now these transaction are like 16 GB, so it would make each file like 22 MB, which is nice even for sending it over network. Also, file with 1000 empty blocks would be like 100 kB.

Everything would be smoother with less problems, while still having ability to easily rsync the data, and copying files would be much faster.

gab81 · 2022-12-16T11:09:08Z

yes, in addition here's a scenario i recently encountered: I recently moved Dero's CLI to another drive for backup, can happen right? and it took FOREVER, 4.6 million files in total - no joke - with my 16GB computer sucking up memory, had to constantly free it up during the copy, it worked though. Then Windows started indexing all of them as well, creating a massive index database file, i had to delete later, fine. I totally agree with what Robyer suggested and hope something is done on this, make it more efficient :)

thanks,
Gab

lcances · 2024-05-08T07:52:51Z

Adding my 2 cents to this (I know it is old that also a note to my to do list).

I would like to propose Hierarchical Data Format version 5 (HDF5).
It is commonly used to store large dataset and each HDF file can be seen as a key : value local database.

Each HDF files could be a collection of 10 000 blocks, which will result in 368 files of 0.29 GB (as today), and the advantage are multiples:

Removing the overhead of moving / downloading millions of files, which will most likely reduce the full sync from days to hours
fast-sync will be very very fast as only the two last HDF files will need to be downloaded (0.6 GB)
Reading the history will be also much faster and dero-explorer will greatly benefit from it
HDF support live compression on read and write, we could most likely reduce the size of the chain and accelerate even further the synchronisation

The implementation is not that hard, but testing and ensuring reliability could take some time. And during the entire process both the old files and the new files would need to coexist (doubling the required storage space)

Step for implementation could be (most likely different way to do it, but that how I will do it)

Refactor current tx / block IO mechanisme to use an interface
Implement HDF solution using the exact same interface
Add a parameter to start the node using HDF of single files
Create a tool to copy the files into HDF files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize storage of the blockchain data #87

Optimize storage of the blockchain data #87

Robyer commented Aug 3, 2022 •

edited

Loading

gab81 commented Dec 16, 2022 •

edited

Loading

lcances commented May 8, 2024 •

edited

Loading

Optimize storage of the blockchain data #87

Optimize storage of the blockchain data #87

Comments

Robyer commented Aug 3, 2022 • edited Loading

gab81 commented Dec 16, 2022 • edited Loading

lcances commented May 8, 2024 • edited Loading

Robyer commented Aug 3, 2022 •

edited

Loading

gab81 commented Dec 16, 2022 •

edited

Loading

lcances commented May 8, 2024 •

edited

Loading