Question about LAMMPS MACE MD simulation using multiple GPU nodes. #498

turbosonics · 2024-07-01T19:57:24Z

turbosonics
Jul 1, 2024

I wrote #487 this article in issue tab about CUDA out of memory crash by 'huge' size geometry, like system with more than 10k number of atoms. Following recommendation from others, I'm fine-tuning a MP-0 'small' model with r_max 4 if that can help the problem.

At the same time, I wonder would it be possible to run such "huge" geometries using MACE MP-0 model from our local cluster environment.

Our local cluster is only equipped 1 GPU per 1 GPU node, and GPU memory for a single GPU is 40GB. With this setup, I can't run that many geometries using MACE MP-0 model from LAMMPS if I use a single GPU node. So, I hope to use multiple GPU nodes.

We are using slurm, so I tried

#SBATCH --nodes=2
#SBATCH --gres=gpu:1
#SBATCH --exclusive

and executed using
lmp -k on g 2 -sf kk -in MACE.input

But I still see the exact same CUDA OOM crash with a single GPU node case. It seems that MACE LAMMPS does not recognize and use memory provided by multi GPU nodes, instead it still uses the memory of a single GPU node.

Could LAMMPS-MACE utilize more memory resources from multiple GPU nodes?

wcwitt · 2024-07-01T20:02:42Z

wcwitt
Jul 1, 2024
Maintainer

It is possible to do what you want, but note that our documentation says

At present, only single-GPU evaluation is recommended.

The reason is that multi-GPU will likely be much slower (for now). If, hearing this, you still want to try for memory reasons, you will need something like

mpirun -np 2 lmp -k on g 2 -sf kk -in MACE.input

5 replies

turbosonics Jul 1, 2024
Author

Thanks, let me recompile my LAMMPS-MACE with openmpi to GPU cluster and try again.

turbosonics Jul 2, 2024
Author

I compiled LAMMPS-MACE with openmpi. Compiling itself didn't cause any problem, but when I try to execute the lmp with mpirun, I got RuntimeError: CUDA error: device-side assert triggered. Here are modules:
gcc/11.2.0
intel/2019
cuda11.8
cudnn/8.1.1.33-11.2-gcc-milan-a100
openmpi/4.1.1-gcc-milan-a100
cmake/3.21.4-gcc-milan-a100
git/2.31.1-gcc-milan-a100
Libtorch was libtorch-shared-with-deps-2.3.1+cu118, as we use CUDA 11.8. But I really don't know how to solve this issue... Am I loading wrong modules?

Do you have any cmake preset file for LAMMPS-MACE, just in case?

wcwitt Jul 2, 2024
Maintainer

Sorry, all I can really say is that the recipe in our docs works for our A100 cluster. It can be challenging to get Kokkos/CUDA/MPI all talking to one another happily - do you have a system administrator who could help on your system?

turbosonics Jul 2, 2024
Author

It is strange. Compiling and building of LAMMPS-MACE didn't crash. But crash happens only when I submit the job.

Here's another strange thing:

Even with mpirun, if I submit the job using a single node with 1000 atom system (which works without mpirun as well) with "mpirun -n 1 lmp -k on g 1 -sf kk -in MACE.input", then the MD simulation works.
But if I submit the job with 2 GPU nodes with the same 1000 atom system with "mpirun -n 2 lmp -k on g 2 -sf kk -in MACE.input", then it fails with "RuntimeError: CUDA error: device-side assert triggered"

I googled this error, looks like it is also related to CUDA out of memory:
https://stackoverflow.com/questions/78471663/what-do-these-torch-use-cuda-dsa-and-frozen-modules-errors-mean-and-how-to-fix-t
https://stackoverflow.com/questions/55780923/what-does-runtimeerror-cuda-error-device-side-assert-triggered-in-pytorch-me

I think this may be caused by our cluster's architecture and lack of memory, but I wonder if LAMMPS-MACE can be improved further.

Have you guys tried the LAMMPS-MACE MD simulation using 2 or 4 GPU nodes using mpirun like I tried, for "big" system with many number of atoms?

wcwitt Jul 3, 2024
Maintainer

Have you guys tried the LAMMPS-MACE MD simulation using 2 or 4 GPU nodes using mpirun like I tried, for "big" system with many number of atoms?

Yes, and it works, but there are some complications we are still working on. This is why we say

At present, only single-GPU evaluation is recommended.

Have you removed the no_domain_decomposition tag when attempting multi-GPU?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about LAMMPS MACE MD simulation using multiple GPU nodes. #498

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question about LAMMPS MACE MD simulation using multiple GPU nodes. #498

turbosonics Jul 1, 2024

Replies: 1 comment · 5 replies

wcwitt Jul 1, 2024 Maintainer

turbosonics Jul 1, 2024 Author

turbosonics Jul 2, 2024 Author

wcwitt Jul 2, 2024 Maintainer

turbosonics Jul 2, 2024 Author

wcwitt Jul 3, 2024 Maintainer

turbosonics
Jul 1, 2024

Replies: 1 comment 5 replies

wcwitt
Jul 1, 2024
Maintainer

turbosonics Jul 1, 2024
Author

turbosonics Jul 2, 2024
Author

wcwitt Jul 2, 2024
Maintainer

turbosonics Jul 2, 2024
Author

wcwitt Jul 3, 2024
Maintainer