GitHub - sehoffmann/dmlcloud at fa6f4265ad4702c9eae13bb980de5c492e32e029

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.github		.github
dmlcloud		dmlcloud
doc		doc
examples		examples
misc/logo		misc/logo
test		test
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
ci_requirements.txt		ci_requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Repository files navigation

A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.

Highlights

Simple, yet powerful, API
Easy initialization of torch.distributed
Distributed metrics
Extensive logging and diagnostics
Wandb support
Tensorboard support
A wealth of useful utility functions

Installation

dmlcloud can be installed directly from PyPI:

pip install dmlcloud

Alternatively, you can install the latest development version directly from Github:

pip install git+https://github.com/sehoffmann/dmlcloud.git

Minimal Example

See examples/mnist.py for a minimal example on how to train MNIST with multiple GPUS. To run it with 4 GPUs, use

dmlrun -n 4 python examples/mnist.py

dmlrun is a thin wrapper around torchrun that makes it easier to prototype on a single node.

Slurm Support

dmlcloud automatically looks for slurm environment variables to initialize torch.distributed. On a slurm cluster, you can hence simply use srun from within an sbatch script to train on multiple nodes:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none

srun python examples/mnist.py

Documentation

You can find the official documentation at Read the Docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highlights

Installation

Minimal Example

Slurm Support

Documentation

About

Releases 4

Packages

Languages

License

sehoffmann/dmlcloud

Folders and files

Latest commit

History

Repository files navigation

Highlights

Installation

Minimal Example

Slurm Support

Documentation

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages