Skip to content

sehoffmann/dmlcloud

Repository files navigation

Dmlcloud Logo

PyPI Status Documentation Status Test Status

A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.

Highlights

  • Simple, yet powerful, API
  • Easy initialization of torch.distributed
  • Distributed metrics
  • Extensive logging and diagnostics
  • Wandb support
  • Tensorboard support
  • A wealth of useful utility functions

Installation

dmlcloud can be installed directly from PyPI:

pip install dmlcloud

Alternatively, you can install the latest development version directly from Github:

pip install git+https://github.com/sehoffmann/dmlcloud.git

Minimal Example

See examples/mnist.py for a minimal example on how to train MNIST with multiple GPUS. To run it with 4 GPUs, use

dmlrun -n 4 python examples/mnist.py

dmlrun is a thin wrapper around torchrun that makes it easier to prototype on a single node.

Slurm Support

dmlcloud automatically looks for slurm environment variables to initialize torch.distributed. On a slurm cluster, you can hence simply use srun from within an sbatch script to train on multiple nodes:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none

srun python examples/mnist.py

Documentation

You can find the official documentation at Read the Docs

About

Distributed torch training using horovod and slurm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages