GitHub - sehoffmann/dmlcloud: Distributed torch training using horovod and slurm

A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.

Highlights

Simple, yet powerful, API
Easy initialization of torch.distributed
Distributed metrics
Extensive logging and diagnostics
Wandb support
Tensorboard support
A wealth of useful utility functions

Installation

dmlcloud can be installed directly from PyPI:

pip install dmlcloud

Alternatively, you can install the latest development version directly from Github:

pip install git+https://github.com/sehoffmann/dmlcloud.git

Documentation

You can find the official documentation at Read the Docs

Minimal Example

See examples/mnist.py for a minimal example on how to train MNIST with multiple GPUS. To run it with 4 GPUs, use

dmlrun -n 4 python examples/mnist.py

dmlrun is a thin wrapper around torchrun that makes it easier to prototype on a single node.

Slurm Support

dmlcloud automatically looks for slurm environment variables to initialize torch.distributed. On a slurm cluster, you can hence simply use srun from within an sbatch script to train on multiple nodes:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none

srun python examples/mnist.py

FAQ

How is dmlcloud different from similar libraries like pytorch lightning or fastai?

dmlcloud was designed foremost with one underlying principle:

No unnecessary abstractions, just help with distributed training

As a consequence, dmlcloud code is almost identical to a regular pytorch training loop and only requires a few adjustments here and there. In contrast, other libraries often introduce extensive API's that can quickly feel overwhelming due to their sheer amount of options.

For instance, the constructor of ligthning.Trainer has 51 arguments! dml.Pipeline only has 2.

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
.github		.github
dmlcloud		dmlcloud
doc		doc
examples		examples
misc/logo		misc/logo
test		test
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
ci_requirements.txt		ci_requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highlights

Installation

Documentation

Minimal Example

Slurm Support

FAQ

How is dmlcloud different from similar libraries like pytorch lightning or fastai?

About

Releases 4

Packages

Languages

License

sehoffmann/dmlcloud

Folders and files

Latest commit

History

Repository files navigation

Highlights

Installation

Documentation

Minimal Example

Slurm Support

FAQ

How is dmlcloud different from similar libraries like pytorch lightning or fastai?

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages