Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Submission Process Rules #476

Merged
merged 19 commits into from
Nov 3, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'dev' into CfS
  • Loading branch information
fsschneider committed Aug 11, 2023
commit 06238ae3bacce7383acfe0d354b7a3648bf303e1
60 changes: 30 additions & 30 deletions GETTING_STARTED.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,28 @@
# Getting Started

Table of Contents:

- [Getting Started](#getting-started)
- [Workspace set up and installation](#workspace-set-up-and-installation)
- [Download the data](#download-the-data)
- [Develop your submission](#develop-your-submission)
- [Set up your directory structure (Optional)](#set-up-your-directory-structure-optional)
- [Coding your submission](#coding-your-submission)
- [Run your submission](#run-your-submission)
- [Pytorch DDP](#pytorch-ddp)
- [Run your submission in a Docker container](#run-your-submission-in-a-docker-container)
- [Docker Tips](#docker-tips)
- [Score your submission](#score-your-submission)
- [Good Luck](#good-luck)

## Workspace set up and installation
- [Set up and installation](#set-up-and-installation)
- [Download the data](#download-the-data)
- [Develop your submission](#develop-your-submission)
- [Set up your directory structure (Optional)](#set-up-your-directory-structure-optional)
- [Coding your submission](#coding-your-submission)
- [Run your submission](#run-your-submission)
- [Pytorch DDP](#pytorch-ddp)
- [Run your submission in a Docker container](#run-your-submission-in-a-docker-container)
- [Docker Tips](#docker-tips)
- [Score your submission](#score-your-submission)
- [Good Luck](#good-luck)

## Set up and installation

To get started you will have to make a few decisions and install the repository along with its dependencies. Specifically:

1. Decide if you would like to develop your submission in either Pytorch or Jax.
2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware).
The specs on the benchmarking machines are:
The specs on the benchmarking machines are:
- 8 V100 GPUs
- 240 GB in RAM
- 2 TB in storage (for datasets).

3. Install the algorithmic package and dependencies, see [Installation](./README.md#installation).

## Download the data
@@ -102,7 +100,7 @@ python3 submission_runner.py \
--tuning_search_space=<path_to_tuning_search_space>
```

#### Pytorch DDP
### Pytorch DDP

We recommend using PyTorch's [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
when using multiple GPUs on a single node. You can initialize ddp with torchrun.
@@ -132,14 +130,14 @@ torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \

The container entrypoint script provides the following flags:

- `-d` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/<dataset>` does not exist on the host machine. Required for running a submission.
- `-f` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission.
- `-s` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
- `-t` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission.
- `-e` experiment_name: name of experiment. Required for running a submission.
- `-w` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission.
- `-m` max_steps: maximum number of steps to run the workload for. Optional.
- `-b` debugging_mode: can be true or false. If `-b` (debugging_mode) is `true` the main process on the container will persist.
- `--dataset` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/<dataset>` does not exist on the host machine. Required for running a submission.
- `--framework` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission.
- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
- `--tuning_search_space` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission.
- `--experiment_name` experiment_name: name of experiment. Required for running a submission.
- `--workload` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission.
- `--max_global_steps` max_global_steps: maximum number of steps to run the workload for. Optional.
- `--keep_container_alive` : can be true or false. If`true` the container will not be killed automatically. This is useful for developing or debugging.

To run the docker container that will run the submission runner run:

@@ -162,23 +160,23 @@ docker run -t -d \

This will print the container ID to the terminal.

#### Docker Tips ####
#### Docker Tips

To find the container IDs of running containers

```
```bash
docker ps
```

To see output of the entrypoint script

```
```bash
docker logs <container_id>
```

To enter a bash session in the container

```
```bash
docker exec -it <container_id> /bin/bash
```

@@ -190,4 +188,6 @@ To produce performance profile and performance table:
python3 scoring/score_submission.py --experiment_path=<path_to_experiment_dir> --output_dir=<output_dir>
```

We provide the scores and performance profiles for the baseline algorithms in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).

## Good Luck
44 changes: 23 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -22,18 +22,19 @@

[MLCommons Algorithmic Efficiency](https://mlcommons.org/en/groups/research-algorithms/) is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the [competition rules](RULES.md) and the benchmark code to run it. For a detailed description of the benchmark design, see our [paper](https://arxiv.org/abs/2306.07179).

# Table of Contents
## Table of Contents

- [MLCommons™ Algorithmic Efficiency](#mlcommons-algorithmic-efficiency)
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Virtual environment](#virtual-environment)
- [Installation](#installation)
- [Python virtual environment](#python-virtual-environment)
- [Docker](#docker)
- [Building Docker Image](#building-docker-image)
- [Running Docker Container (Interactive)](#running-docker-container-interactive)
- [Running Docker Container (End-to-end)](#running-docker-container-end-to-end)
- [Getting Started](#getting-started)
- [Running a workload](#running-a-workload)
- [JAX](#jax)
- [Pytorch](#pytorch)
- [Rules](#rules)
- [Contributing](#contributing)
- [Note on shared data pipelines between JAX and PyTorch](#note-on-shared-data-pipelines-between-jax-and-pytorch)
@@ -58,7 +59,7 @@ You can install this package and dependences in a [python virtual environment](#
pip3 install -e '.[full]'
```

## Virtual environment
### Python virtual environment

Note: Python minimum requirement >= 3.8

@@ -99,14 +100,14 @@ pip3 install -e '.[full]'

</details>

## Docker
### Docker

We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.

**Prerequisites for NVIDIA GPU set up**: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs.
See instructions [here](https://github.com/NVIDIA/nvidia-docker).

### Building Docker Image
#### Building Docker Image

1. Clone this repository

@@ -121,12 +122,14 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).
docker build -t <docker_image_name> . --build-arg framework=<framework>
```

The `framework` flag can be either `pytorch`, `jax` or `both`.
The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies.
The `docker_image_name` is arbitrary.

### Running Docker Container (Interactive)
#### Running Docker Container (Interactive)

1. Run detached Docker Container
To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission.

1. Run detached Docker Container. The container_id will be printed if the container is run successfully.

```bash
docker run -t -d \
@@ -140,28 +143,27 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).
-keep_container_alive true
```

This will print out a container id.
2. Open a bash terminal

```bash
docker exec -it <container_id> /bin/bash
```

### Running Docker Container (End-to-end)
#### Running Docker Container (End-to-end)

To run a submission end-to-end in a container see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

# Getting Started
## Getting Started

For instructions on developing and scoring your own algorithm in the benchmark see [Getting Started Document](./getting_started.md).

## Running a workload
### Running a workload

To run a submission directly by running a Docker container, see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

From your virtual environment or interactively running Docker container run:

**JAX**
#### JAX

```bash
python3 submission_runner.py \
@@ -173,7 +175,7 @@ python3 submission_runner.py \
--tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json
```

**Pytorch**
#### Pytorch

```bash
python3 submission_runner.py \
@@ -205,7 +207,7 @@ torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc

So the complete command is for example:

```
```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 \
submission_runner.py \
--framework=pytorch \
@@ -218,15 +220,15 @@ submission_runner.py \

</details>

# Rules
## Rules

The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate [rules document](RULES.md). Suggestions, clarifications and questions can be raised via pull requests.

# Contributing
## Contributing

If you are interested in contributing to the work of the working group, feel free to [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/), open issues. See our [CONTRIBUTING.md](CONTRIBUTING.md) for MLCommons contributing guidelines and setup and workflow instructions.

# Note on shared data pipelines between JAX and PyTorch
## Note on shared data pipelines between JAX and PyTorch

The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.

You are viewing a condensed version of this merge commit. You can view the full changes here.