Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize storage of the blockchain data #87

Open
Robyer opened this issue Aug 3, 2022 · 2 comments
Open

Optimize storage of the blockchain data #87

Robyer opened this issue Aug 3, 2022 · 2 comments

Comments

@Robyer
Copy link

Robyer commented Aug 3, 2022

I think there are 2 main problems with the blockchain data storage right now:

  1. enormous amount of files
  2. extremely small size of these files

Filesystem has some cluster size ( = as a smallest unit of data that can be addressed on the disk) and if your files are smaller than that then the remaining space is wasted as it can't be used for anything else. For example NTFS has default size of 4 KB. Most of the files in Dero mainnet folder has only 101 bytes (I think that is single transaction or single block with no transactions in it). That means that every such file that Dero saves is wasting 3995 bytes from that 4 KB, which is 97,5 % of wasted space for these files.

Not mentioning fragmentation and slow access speed (for reading, writing, deleting) as opposite when you have one large file.

Also this high amount of separate files is also a problem itself, as the number of files the filesystem can hold is also limited.

I see that in mainnet/balances folder are larger files - each having 2 GB. I don't know what these files are (wallets data?), but this is much more effective / ideal approach. You don't have 1 file per wallet either.

Possible solution: Don't save each block/transaction as single file, but combine multiple blocks into single file. It could be even defined by some constant in code, how many transactions combine together. So if needed, advanced user can modify it according to his needs. Or it could be made dynamic, combining less blocks when there are more transactions.

E.g. combine each 1000 blocks into single file. Right now Dero has 734252 blocks, so it would be 735 files. Right now these transaction are like 16 GB, so it would make each file like 22 MB, which is nice even for sending it over network. Also, file with 1000 empty blocks would be like 100 kB.

Everything would be smoother with less problems, while still having ability to easily rsync the data, and copying files would be much faster.

@gab81
Copy link

gab81 commented Dec 16, 2022

yes, in addition here's a scenario i recently encountered: I recently moved Dero's CLI to another drive for backup, can happen right? and it took FOREVER, 4.6 million files in total - no joke - with my 16GB computer sucking up memory, had to constantly free it up during the copy, it worked though. Then Windows started indexing all of them as well, creating a massive index database file, i had to delete later, fine. I totally agree with what Robyer suggested and hope something is done on this, make it more efficient :)

thanks,
Gab

@lcances
Copy link

lcances commented May 8, 2024

Adding my 2 cents to this (I know it is old that also a note to my to do list).

I would like to propose Hierarchical Data Format version 5 (HDF5).
It is commonly used to store large dataset and each HDF file can be seen as a key : value local database.

Each HDF files could be a collection of 10 000 blocks, which will result in 368 files of 0.29 GB (as today), and the advantage are multiples:

  • Removing the overhead of moving / downloading millions of files, which will most likely reduce the full sync from days to hours
  • fast-sync will be very very fast as only the two last HDF files will need to be downloaded (0.6 GB)
  • Reading the history will be also much faster and dero-explorer will greatly benefit from it
  • HDF support live compression on read and write, we could most likely reduce the size of the chain and accelerate even further the synchronisation

The implementation is not that hard, but testing and ensuring reliability could take some time. And during the entire process both the old files and the new files would need to coexist (doubling the required storage space)

Step for implementation could be (most likely different way to do it, but that how I will do it)

  1. Refactor current tx / block IO mechanisme to use an interface
  2. Implement HDF solution using the exact same interface
  3. Add a parameter to start the node using HDF of single files
  4. Create a tool to copy the files into HDF files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants