Merge branch 'dev' into CfS

mlcommons · priyakasimbeg · Nov 3, 2023 · Aug 10, 2023 · Aug 11, 2023 · Aug 11, 2023
commit 06238ae3bacce7383acfe0d354b7a3648bf303e1
@@ -1,30 +1,28 @@
 # Getting Started
 
-Table of Contents:
-
-- [Getting Started](#getting-started)
-  - [Workspace set up and installation](#workspace-set-up-and-installation)
-  - [Download the data](#download-the-data)
-  - [Develop your submission](#develop-your-submission)
-    - [Set up your directory structure (Optional)](#set-up-your-directory-structure-optional)
-    - [Coding your submission](#coding-your-submission)
-  - [Run your submission](#run-your-submission)
-    - [Pytorch DDP](#pytorch-ddp)
-    - [Run your submission in a Docker container](#run-your-submission-in-a-docker-container)
-      - [Docker Tips](#docker-tips)
-  - [Score your submission](#score-your-submission)
-  - [Good Luck](#good-luck)
-
-## Workspace set up and installation
+- [Set up and installation](#set-up-and-installation)
+- [Download the data](#download-the-data)
+- [Develop your submission](#develop-your-submission)
+  - [Set up your directory structure (Optional)](#set-up-your-directory-structure-optional)
+  - [Coding your submission](#coding-your-submission)
+- [Run your submission](#run-your-submission)
+  - [Pytorch DDP](#pytorch-ddp)
+  - [Run your submission in a Docker container](#run-your-submission-in-a-docker-container)
+    - [Docker Tips](#docker-tips)
+- [Score your submission](#score-your-submission)
+- [Good Luck](#good-luck)
+
+## Set up and installation
 
 To get started you will have to make a few decisions and install the repository along with its dependencies. Specifically:
 
 1. Decide if you would like to develop your submission in either Pytorch or Jax.
 2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware).
-  The specs on the benchmarking machines are:
+The specs on the benchmarking machines are:
     - 8 V100 GPUs
     - 240 GB in RAM
     - 2 TB in storage (for datasets).
+
 3. Install the algorithmic package and dependencies, see [Installation](./README.md#installation).
 
 ## Download the data
@@ -102,7 +100,7 @@ python3 submission_runner.py \
     --tuning_search_space=<path_to_tuning_search_space>
 ```
 
-#### Pytorch DDP
+### Pytorch DDP
 
 We recommend using PyTorch's [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
 when using multiple GPUs on a single node. You can initialize ddp with torchrun.
@@ -132,14 +130,14 @@ torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
 
 The container entrypoint script provides the following flags:
 
-- `-d` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/<dataset>` does not exist on the host machine. Required for running a submission.
-- `-f` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission.
-- `-s` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
-- `-t` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission.
-- `-e` experiment_name: name of experiment. Required for running a submission.
-- `-w` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission.
-- `-m` max_steps: maximum number of steps to run the workload for. Optional.
-- `-b` debugging_mode: can be true or false. If `-b` (debugging_mode) is `true` the main process on the container will persist.
+- `--dataset` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/<dataset>` does not exist on the host machine. Required for running a submission.
+- `--framework` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission.
+- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
+- `--tuning_search_space` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission.
+- `--experiment_name` experiment_name: name of experiment. Required for running a submission.
+- `--workload` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission.
+- `--max_global_steps` max_global_steps: maximum number of steps to run the workload for. Optional.
+- `--keep_container_alive` : can be true or false. If`true` the container will not be killed automatically. This is useful for developing or debugging.
 
 To run the docker container that will run the submission runner run:
 
@@ -162,23 +160,23 @@ docker run -t -d \
 
 This will print the container ID to the terminal.
 
-#### Docker Tips ####
+#### Docker Tips
 
 To find the container IDs of running containers
 
-```
+```bash
 docker ps 
 ```
 
 To see output of the entrypoint script
 
-```
+```bash
 docker logs <container_id> 
 ```
 
 To enter a bash session in the container
 
-```
+```bash
 docker exec -it <container_id> /bin/bash
 ```
 
@@ -190,4 +188,6 @@ To produce performance profile and performance table:
 python3 scoring/score_submission.py --experiment_path=<path_to_experiment_dir> --output_dir=<output_dir>
 ```
 
+We provide the scores and performance profiles for the baseline algorithms in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).
+
 ## Good Luck
@@ -22,18 +22,19 @@
 
 [MLCommons Algorithmic Efficiency](https://mlcommons.org/en/groups/research-algorithms/) is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the [competition rules](RULES.md) and the benchmark code to run it. For a detailed description of the benchmark design, see our [paper](https://arxiv.org/abs/2306.07179).
 
-# Table of Contents
+## Table of Contents
 
-- [MLCommons™ Algorithmic Efficiency](#mlcommons-algorithmic-efficiency)
 - [Table of Contents](#table-of-contents)
-  - [Installation](#installation)
-  - [Virtual environment](#virtual-environment)
+- [Installation](#installation)
+  - [Python virtual environment](#python-virtual-environment)
   - [Docker](#docker)
     - [Building Docker Image](#building-docker-image)
     - [Running Docker Container (Interactive)](#running-docker-container-interactive)
     - [Running Docker Container (End-to-end)](#running-docker-container-end-to-end)
 - [Getting Started](#getting-started)
   - [Running a workload](#running-a-workload)
+    - [JAX](#jax)
+    - [Pytorch](#pytorch)
 - [Rules](#rules)
 - [Contributing](#contributing)
 - [Note on shared data pipelines between JAX and PyTorch](#note-on-shared-data-pipelines-between-jax-and-pytorch)
@@ -58,7 +59,7 @@ You can install this package and dependences in a [python virtual environment](#
    pip3 install -e '.[full]'
    ```
 
-## Virtual environment
+### Python virtual environment
 
 Note: Python minimum requirement >= 3.8
 
@@ -99,14 +100,14 @@ pip3 install -e '.[full]'
 
 </details>
 
-## Docker
+### Docker
 
 We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.
 
 **Prerequisites for NVIDIA GPU set up**: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs.
 See instructions [here](https://github.com/NVIDIA/nvidia-docker).
 
-### Building Docker Image
+#### Building Docker Image
 
 1. Clone this repository
 
@@ -121,12 +122,14 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).
    docker build -t <docker_image_name> . --build-arg framework=<framework>
    ```
 
-   The `framework` flag can be either `pytorch`, `jax` or `both`.
+   The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies.
    The `docker_image_name` is arbitrary.
 
-### Running Docker Container (Interactive)
+#### Running Docker Container (Interactive)
 
-1. Run detached Docker Container
+To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission.
+
+1. Run detached Docker Container. The container_id will be printed if the container is run successfully.
 
    ```bash
    docker run -t -d \
@@ -140,28 +143,27 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).
       -keep_container_alive true
    ```
 
-   This will print out a container id.
 2. Open a bash terminal
 
    ```bash
    docker exec -it <container_id> /bin/bash
    ```
 
-### Running Docker Container (End-to-end)
+#### Running Docker Container (End-to-end)
 
-To run a submission end-to-end in a container see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
+To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
 
-# Getting Started
+## Getting Started
 
 For instructions on developing and scoring your own algorithm in the benchmark see [Getting Started Document](./getting_started.md).
 
-## Running a workload
+### Running a workload
 
 To run a submission directly by running a Docker container, see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
 
 From your virtual environment or interactively running Docker container run:
 
-**JAX**
+#### JAX
 
 ```bash
 python3 submission_runner.py \
@@ -173,7 +175,7 @@ python3 submission_runner.py \
     --tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json
 ```
 
-**Pytorch**
+#### Pytorch
 
 ```bash
 python3 submission_runner.py \
@@ -205,7 +207,7 @@ torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc
 
 So the complete command is for example:
 
-```
+```bash
 torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 \
 submission_runner.py \
     --framework=pytorch \
@@ -218,15 +220,15 @@ submission_runner.py \
 
 </details>
 
-# Rules
+## Rules
 
 The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate [rules document](RULES.md). Suggestions, clarifications and questions can be raised via pull requests.
 
-# Contributing
+## Contributing
 
 If you are interested in contributing to the work of the working group, feel free to [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/), open issues. See our [CONTRIBUTING.md](CONTRIBUTING.md) for MLCommons contributing guidelines and setup and workflow instructions.
 
-# Note on shared data pipelines between JAX and PyTorch
+## Note on shared data pipelines between JAX and PyTorch
 
 The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.