-
I'm trying to understand the documentation shown here: https://mace-docs.readthedocs.io/en/latest/guide/multipreprocessing.html It shows an example that uses preprocess_data.py on a large dataset; presumably the large dataset is "train_large.xyz". The example appears to have a typo in it; the last line is "continued" and preprocess_data.py is called a 2nd time on the same line. Is this supposed to be a second call to preprocess_data.py? Or should one ignore the 2nd instance of preprocess_data.py? When I ignore the 2nd instance of preprocess_data.py, it produces the following files (note: I do not provide a test set):
The documentation continues by saying:
However, when I run the command, I receive the error:
Was preprocess_data.py supposed to generate ./processed_data/train.h5? Or did the documentation intend to read processed_data/train/train_0.h5 ? If so, then doesn't this simply ignore 3/4 of the training data? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hey @tjgiese, Thank you for catching this typo in the doc, I have fixed it and also clarified the doc. In case you use multiple threads to do the multiprocessing (which is your case), you need to use the following command to train:
You pass the full folder path. |
Beta Was this translation helpful? Give feedback.
-
Hi @ilyes319, If I understood well, in the current preprocessing workflow, we need to pack all data into a single giant XYZ file before converting it into H5 files. In some scenarios (viz., 100x GBs dataset), this might not be practical. To address this, we may adapt the load_from_xyz function to handle directories containing multiple XYZ files.
Could you please advise, what are your recommendations for managing such large datasets efficiently during training in a single/multi-node multi-gpu setup? for example Nr splits vs. Nr. gpus/taks/batch etc. Thank you for your assistance and the great work on MACE. |
Beta Was this translation helpful? Give feedback.
Hey @tjgiese,
Thank you for catching this typo in the doc, I have fixed it and also clarified the doc.
In case you use multiple threads to do the multiprocessing (which is your case), you need to use the following command to train:
You pass the full folder path.