A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.
- Simple, yet powerful, API
- Easy initialization of
torch.distributed
- Distributed metrics
- Extensive logging and diagnostics
- Wandb support
- Tensorboard support
- A wealth of useful utility functions
dmlcloud can be installed directly from PyPI:
pip install dmlcloud
Alternatively, you can install the latest development version directly from Github:
pip install git+https://github.com/sehoffmann/dmlcloud.git
See examples/mnist.py for a minimal example on how to train MNIST with multiple GPUS. To run it with 4 GPUs, use
dmlrun -n 4 python examples/mnist.py
dmlrun
is a thin wrapper around torchrun
that makes it easier to prototype on a single node.
dmlcloud automatically looks for slurm environment variables to initialize torch.distributed. On a slurm cluster, you can hence simply use srun
from within an sbatch script to train on multiple nodes:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none
srun python examples/mnist.py
You can find the official documentation at Read the Docs