Large Dataset Pre-processing documentation #500

tjgiese · 2024-07-03T15:18:55Z

tjgiese
Jul 3, 2024

I'm trying to understand the documentation shown here: https://mace-docs.readthedocs.io/en/latest/guide/multipreprocessing.html

It shows an example that uses preprocess_data.py on a large dataset; presumably the large dataset is "train_large.xyz". The example appears to have a typo in it; the last line is "continued" and preprocess_data.py is called a 2nd time on the same line. Is this supposed to be a second call to preprocess_data.py? Or should one ignore the 2nd instance of preprocess_data.py?

When I ignore the 2nd instance of preprocess_data.py, it produces the following files (note: I do not provide a test set):

processed_data/:
statistics.json  test  train  val

processed_data/test:

processed_data/train:
train_0.h5  train_1.h5  train_2.h5  train_3.h5

processed_data/val:
val_0.h5  val_1.h5  val_2.h5  val_3.h5

The documentation continues by saying:

The preprocessed data can be used for training the model using the on-line dataloader as shown in the example below.

python <mace_repo_dir>/mace/cli/run_train.py \
--name="MACE_on_big_data" \
--num_workers=16 \
--train_file="./processed_data/train.h5" \
--valid_file="./processed_data/valid.h5" \
--test_dir="./processed_data" \
--statistics_file="./processed_data/statistics.json" \
[...]

However, when I run the command, I receive the error:

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = './processed_data/train.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Was preprocess_data.py supposed to generate ./processed_data/train.h5? Or did the documentation intend to read processed_data/train/train_0.h5 ? If so, then doesn't this simply ignore 3/4 of the training data?

Answered by ilyes319

Jul 3, 2024

Hey @tjgiese,

Thank you for catching this typo in the doc, I have fixed it and also clarified the doc.

In case you use multiple threads to do the multiprocessing (which is your case), you need to use the following command to train:

python <mace_repo_dir>/mace/cli/run_train.py \
--name="MACE_on_big_data" \
--num_workers=16 \
--train_file="./processed_data/train \
--valid_file="./processed_data/valid" \
--test_dir="./processed_data/test" \
--statistics_file="./processed_data/statistics.json" \

You pass the full folder path.

View full answer

ilyes319 · 2024-07-03T15:34:29Z

ilyes319
Jul 3, 2024
Maintainer

Hey @tjgiese,

Thank you for catching this typo in the doc, I have fixed it and also clarified the doc.

In case you use multiple threads to do the multiprocessing (which is your case), you need to use the following command to train:

python <mace_repo_dir>/mace/cli/run_train.py \
--name="MACE_on_big_data" \
--num_workers=16 \
--train_file="./processed_data/train \
--valid_file="./processed_data/valid" \
--test_dir="./processed_data/test" \
--statistics_file="./processed_data/statistics.json" \

You pass the full folder path.

0 replies

mwzum · 2024-07-08T14:51:24Z

mwzum
Jul 8, 2024

Hi @ilyes319,

If I understood well, in the current preprocessing workflow, we need to pack all data into a single giant XYZ file before converting it into H5 files. In some scenarios (viz., 100x GBs dataset), this might not be practical. To address this, we may adapt the load_from_xyz function to handle directories containing multiple XYZ files.

def load_from_xyz(file_path):
   atoms_list = []
   # Check if the file_path is a directory and get all XYZ files
   if os.path.isdir(file_path):
       xyz_files = glob(os.path.join(file_path, "*.extxyz"))
   else:
       xyz_files = [file_path]

   for xyz_file in xyz_files:
       try:
           this_configs = ase.io.read(xyz_file, index=":")
           atoms_list.extend(this_configs)
       except Exception as e:
           print(f"Error reading file {xyz_file}: {e}")
   
   return atoms_list

Could you please advise, what are your recommendations for managing such large datasets efficiently during training in a single/multi-node multi-gpu setup? for example Nr splits vs. Nr. gpus/taks/batch etc.

Thank you for your assistance and the great work on MACE.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Dataset Pre-processing documentation #500

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Large Dataset Pre-processing documentation #500

tjgiese Jul 3, 2024

Replies: 2 comments

ilyes319 Jul 3, 2024 Maintainer

mwzum Jul 8, 2024

tjgiese
Jul 3, 2024

ilyes319
Jul 3, 2024
Maintainer

mwzum
Jul 8, 2024