Move documentation around

mila-iqia · Jan 13, 2025 · f09f396 · f09f396
1 parent 9ee81d7
commit f09f396
Show file tree

Hide file tree

Showing 16 changed files with 257 additions and 121 deletions.
diff --git a/docs/config.rst → docs/Contributing/config.rst b/docs/config.rst → docs/Contributing/config.rst
diff --git a/docs/Contributing/design.rst b/docs/Contributing/design.rst
@@ -0,0 +1,49 @@
+Design
+======
+
+Milabench aims to simulate research workloads for benchmarking purposes.
+
+* Performance is measured as throughput (samples / secs).
+  For example, for a model like resnet the throughput would be image per seconds.
+
+* Single GPU workloads are spawned per GPU to ensure the entire machine is used.
+  Simulating something similar to a hyper parameter search.
+  The performance of the benchmark is the sum of throughput of each processes.
+
+* Multi GPU workloads
+
+* Multi Nodes
+
+
+Run
+---
+
+* Milabench Manager Process
+   * Handles messages from benchmark processes
+   * Saves messages into a file for future analysis
+
+* Benchmark processes
+   * run using ``voir``
+   * voir is configured to intercept and send events during the training process
+   * This allow us to add models from git repositories without modification
+   * voir sends data through a file descriptor that was created by milabench main process
+
+
+What milabench is
+-----------------
+
+* Training focused
+* milabench show candid performance numbers
+   * No optimization beyond batch size scaling is performed
+   * we want to measure the performance our researcher will see
+     not the performance they could get.
+* pytorch centric
+   * Pytorch has become the defacto library for research
+   * We are looking for accelerator with good maturity that can support
+     this framework with limited code change.
+
+
+What milabench is not
+---------------------
+
+* milabench goal is not a performance show case of an accelerator.
diff --git a/docs/dev-usage.rst → docs/Contributing/dev-usage.rst b/docs/dev-usage.rst → docs/Contributing/dev-usage.rst
diff --git a/docs/instrument.rst → docs/Contributing/instrument.rst b/docs/instrument.rst → docs/Contributing/instrument.rst
diff --git a/docs/new_benchmarks.rst → docs/Contributing/new_benchmarks.rst b/docs/new_benchmarks.rst → docs/Contributing/new_benchmarks.rst
@@ -1,6 +1,6 @@
 
-Creating a new benchmark
-------------------------
+Adding a benchmark
+==================
 
 To define a new benchmark (let's assume it is called ``ornatebench``), 
 

diff --git a/docs/flow.rst → docs/Contributing/overview.rst b/docs/flow.rst → docs/Contributing/overview.rst
@@ -1,5 +1,5 @@
-Milabench Overview
-------------------
+Overview
+========
 
 .. code-block:: txt
 
@@ -230,11 +230,88 @@ Execution Flow
       * **run_script**: the script will start to run now
       * **finalize**: tearing down
 
-How do I
---------
 
-* I want to run a benchmark without milabench for debugging purposes
-   * ``milabench dev {benchname}`` will open bash with the benchmark venv sourced
-   * alternatively: ``source $MILABENCH_BASE/venv/torch/bin/activate``
+Execution Plan
+--------------
+
+* milabench main process
+  * gather metrics from benchmark processes, save them to file
+  * manages the benchmarks (timeout etc...)
+
+  * if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``)
+    * each processes log their GPU data
+    * might spawn a monitor process
+      * will init pynvml
+    * dataloader will also spawn process workers
+      * usually not using GPU
+
+  * if ``njobs`` is used, milabench will launch a single process (torchrun)
+    * torchrun in turn will spawn one process per GPU
+      * RANK 0 is used for logging
+      * RANK 0 might spawn a monitor process
+        * will init pynvml
+      * dataloader will also spawn process workers 
+        * usually not using GPU
+
+per_gpu
+^^^^^^^
+
+``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark
+
+.. code-block:: yaml
+
+   _torchvision:
+     inherits: _defaults
+     definition: ../benchmarks/torchvision
+     group: torchvision
+     install_group: torch
+     plan:
+       method: per_gpu
+
+Milabench will essentially execute something akin to below. 
+
+.. code-block:: bash
+
+   echo "---"
+   echo "fp16"
+   echo "===="
+   time (
+     CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     wait
+   )
+
+njobs
+^^^^^
+
+``njobs`` used to launch a single jobs that can see all the gpus.
+
+.. code-block:: yaml
 
+   _torchvision_ddp:
+     inherits: _defaults
+     definition: ../benchmarks/torchvision_ddp
+     group: torchvision
+     install_group: torch
+     plan:
+       method: njobs
+       n: 1
+
+Milabench will essentially execute something akin to below.
+
+.. code-block:: bash
+
+   echo "---"
+   echo "lightning-gpus"
+   echo "=============="
+   time (
+     $BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 &
+     wait
+   )
 
diff --git a/docs/process.rst → docs/Contributing/process.rst b/docs/process.rst → docs/Contributing/process.rst
@@ -8,6 +8,7 @@ Preparing
 
    * NVIDIA
    * AMD
+   * Intel
 
 2. Create a milabench configuration for your RFP
    Milabench comes with a wide variety of benchmarks.

diff --git a/docs/recipes.rst → docs/Contributing/recipes.rst b/docs/recipes.rst → docs/Contributing/recipes.rst
@@ -1,5 +1,5 @@
-Running Milabench
-=================
+Recipes
+=======
 
 Base Setup
 ----------
@@ -35,11 +35,9 @@ The current setup runs on 8xA100 SXM4 80Go.
 Note that some benchmarks do require more than 40Go of VRAM.
 One bench might be problematic; rwkv which requires nvcc but can be ignored.
 
-Recipes
--------
 
 Increase Runtime
-^^^^^^^^^^^^^^^^
+----------------
 
 For profiling it might be useful to run the benchmark for longer than the default configuration.
 You can update the yaml file (``config/base.yaml`` or ``config/standard.yaml``) to increase the runtime limits.
@@ -57,7 +55,7 @@ and ``voir.options.stop`` which represent the target number of observations mila
                                  # an observation is usually a batch forward/backward/optimizer.step (i.e one train step)
 
 One Env
-^^^^^^^
+-------
 
 If your are using a container with dependencies such as pytorch already installed,
 you can force milabench to use a single environment for everything.
@@ -69,17 +67,17 @@ you can force milabench to use a single environment for everything.
     milabench run --use-current-env --select bert-fp32 
 
 Batch resizer
-^^^^^^^^^^^^^
+-------------
 
 If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below.
 Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b.
 
 .. code-block:: bash
 
-   MILABENCH_SIZER_AUTO=True milabench run
+   MILABENCH_SIZER_AUTO=1 milabench run
 
 Device Select
-^^^^^^^^^^^^^
+-------------
 
 To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time
 which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster)
@@ -89,7 +87,7 @@ which might make a run take a bit longer, reducing the number of visible devices
    CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run 
 
 Update Package
-^^^^^^^^^^^^^^
+--------------
 
 To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks)
 
@@ -100,7 +98,7 @@ To update pytorch to use a newer version of cuda (milabench creates a separate e
    pip install -U torch torchvision torchaudio
 
 Arguments
-^^^^^^^^^
+---------
 
 If environment variables are troublesome, the values can also be passed as arguments.
 
@@ -118,6 +116,18 @@ It holds all the benchmark specific logs and metrics gathered by milabench.
   zip -r results.zip results
 
 
+Run a benchmark without milabench
+---------------------------------
+
+.. code-block:: bash
+
+   milabench dev {benchname}  # will open bash with the benchmark venv sourced 
+
+   # alternatively
+
+   source $MILABENCH_BASE/venv/torch/bin/activate
+
+
 Containers
 ----------
 
@@ -306,6 +316,7 @@ Example Reports
 
 Issues
 ------
+
 .. code-block:: txt
   
     > Traceback (most recent call last):

diff --git a/docs/sizer.rst → docs/Contributing/sizer.rst b/docs/sizer.rst → docs/Contributing/sizer.rst
diff --git a/docs/docker.rst → docs/GettingStarted/docker.rst b/docs/docker.rst → docs/GettingStarted/docker.rst
diff --git a/docs/usage.rst → docs/GettingStarted/usage.rst b/docs/usage.rst → docs/GettingStarted/usage.rst
diff --git a/docs/Welcome/Changelog.rst b/docs/Welcome/Changelog.rst
@@ -0,0 +1,4 @@
+Changelog
+=========
+
+TBD
diff --git a/docs/Welcome/Features.rst b/docs/Welcome/Features.rst
@@ -0,0 +1,54 @@
+Features
+========
+
+* non intruisive Instrumentation
+* Validation Layers
+* Automatic batch resizing
+* Docker
+* Hardware
+   * ROCm 5.7
+   * NVIDIA
+* Metrics gathering
+   * Performance throughput
+   * GPU util
+   * CPU util
+   * IO util
+
+
+Benchmarks
+----------
+
+.. code-block:: text
+    +--------------------------+-----------+-----------+-------------+-----------+-------------------+
+    |        Benchmark         |   Unit    |  Domain   |   Network   |   Focus   |       Task        |
+    +==========================+===========+===========+=============+===========+===================+
+    | bf16                     | TFlops    | Synthetic |             | Training  |                   |
+    | fp16                     | TFlops    | Synthetic |             | Training  |                   |
+    | tf32                     | TFlops    | Synthetic |             | Training  |                   |
+    | fp32                     | TFlops    | Synthetic |             | Training  |                   |
+    | bert-fp16                |           | NLP       | Transformer | Training  | Language Modeling |
+    | bert-fp32                |           | NLP       | Transformer | Training  | Language Modeling |
+    | bert-tf32                |           | NLP       | Transformer | Training  | Language Modeling |
+    | bert-tf32-fp16           |           | NLP       | Transformer | Training  | Language Modeling |
+    | opt-1_3b                 |           | NLP       | Transformer | Training  | Language Modeling |
+    | opt-6_7b                 |           | NLP       | Transformer | Training  | Language Modeling |
+    | reformer                 |           | NLP       | Transformer | Training  | Language Modeling |
+    | rwkv                     |           | NLP       | RNN         | Training  | Language Modeling |
+    | llama                    | Token/sec | NLP       | Transformer | Inference | Generation        |
+    | dlrm                     |           | NLP       |             | Training  | Recommendation    |
+    | convnext_large-fp16      | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | convnext_large-fp32      | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | convnext_large-tf32      | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | convnext_large-tf32-fp16 | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | davit_large              | img/sec   | Vision    | Transformer | Training  | Classification    |
+    | focalnet                 |           | Vision    | Convolution | Training  | Classification    |
+    | davit_large-multi        | img/sec   | Vision    | Transformer | Training  | Classification    |
+    | regnet_y_128gf           | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | resnet152                | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | resnet152-multi          | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | resnet50                 | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | stargan                  | img/sec   | Vision    | Convolution | Training  | GAN               |
+    | super-slomo              | img/sec   | Vision    | Convolution | Training  |                   |
+    | t5                       |           | NLP       | Transformer | Training  |                   |
+    | whisper                  |           | Audio     |             | Training  |                   |
+    +--------------------------+-----------+-----------+-------------+-----------+-------------------+
diff --git a/docs/Welcome/Roadmap.rst b/docs/Welcome/Roadmap.rst
@@ -0,0 +1,10 @@
+Roadmap
+=======
+
+* Cloud CI
+* ROCm 6.0 - MI300 support
+* GPU Max Series - 1550 support
+* Evaluate suitability
+   * Tenstorrent
+   * Graphcore
+   * Cerebras
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ Preparing @@
        * NVIDIA
        * AMD
+       * Intel
 . Create a milabench configuration for your RFP
        Milabench comes with a wide variety of benchmarks.
@@ Expand Down @@