Skip to content

Commit

Permalink
Move documentation around
Browse files Browse the repository at this point in the history
  • Loading branch information
pierre.delaunay committed Jan 13, 2025
1 parent 9ee81d7 commit f09f396
Show file tree
Hide file tree
Showing 16 changed files with 257 additions and 121 deletions.
File renamed without changes.
49 changes: 49 additions & 0 deletions docs/Contributing/design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Design
======

Milabench aims to simulate research workloads for benchmarking purposes.

* Performance is measured as throughput (samples / secs).
For example, for a model like resnet the throughput would be image per seconds.

* Single GPU workloads are spawned per GPU to ensure the entire machine is used.
Simulating something similar to a hyper parameter search.
The performance of the benchmark is the sum of throughput of each processes.

* Multi GPU workloads

* Multi Nodes


Run
---

* Milabench Manager Process
* Handles messages from benchmark processes
* Saves messages into a file for future analysis

* Benchmark processes
* run using ``voir``
* voir is configured to intercept and send events during the training process
* This allow us to add models from git repositories without modification
* voir sends data through a file descriptor that was created by milabench main process


What milabench is
-----------------

* Training focused
* milabench show candid performance numbers
* No optimization beyond batch size scaling is performed
* we want to measure the performance our researcher will see
not the performance they could get.
* pytorch centric
* Pytorch has become the defacto library for research
* We are looking for accelerator with good maturity that can support
this framework with limited code change.


What milabench is not
---------------------

* milabench goal is not a performance show case of an accelerator.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

Creating a new benchmark
------------------------
Adding a benchmark
==================

To define a new benchmark (let's assume it is called ``ornatebench``),

Expand Down
91 changes: 84 additions & 7 deletions docs/flow.rst → docs/Contributing/overview.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Milabench Overview
------------------
Overview
========

.. code-block:: txt
Expand Down Expand Up @@ -230,11 +230,88 @@ Execution Flow
* **run_script**: the script will start to run now
* **finalize**: tearing down

How do I
--------

* I want to run a benchmark without milabench for debugging purposes
* ``milabench dev {benchname}`` will open bash with the benchmark venv sourced
* alternatively: ``source $MILABENCH_BASE/venv/torch/bin/activate``
Execution Plan
--------------

* milabench main process
* gather metrics from benchmark processes, save them to file
* manages the benchmarks (timeout etc...)

* if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``)
* each processes log their GPU data
* might spawn a monitor process
* will init pynvml
* dataloader will also spawn process workers
* usually not using GPU

* if ``njobs`` is used, milabench will launch a single process (torchrun)
* torchrun in turn will spawn one process per GPU
* RANK 0 is used for logging
* RANK 0 might spawn a monitor process
* will init pynvml
* dataloader will also spawn process workers
* usually not using GPU

per_gpu
^^^^^^^

``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark

.. code-block:: yaml
_torchvision:
inherits: _defaults
definition: ../benchmarks/torchvision
group: torchvision
install_group: torch
plan:
method: per_gpu
Milabench will essentially execute something akin to below.

.. code-block:: bash
echo "---"
echo "fp16"
echo "===="
time (
CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
wait
)
njobs
^^^^^

``njobs`` used to launch a single jobs that can see all the gpus.

.. code-block:: yaml
_torchvision_ddp:
inherits: _defaults
definition: ../benchmarks/torchvision_ddp
group: torchvision
install_group: torch
plan:
method: njobs
n: 1
Milabench will essentially execute something akin to below.

.. code-block:: bash
echo "---"
echo "lightning-gpus"
echo "=============="
time (
$BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 &
wait
)
1 change: 1 addition & 0 deletions docs/process.rst → docs/Contributing/process.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Preparing

* NVIDIA
* AMD
* Intel

2. Create a milabench configuration for your RFP
Milabench comes with a wide variety of benchmarks.
Expand Down
33 changes: 22 additions & 11 deletions docs/recipes.rst → docs/Contributing/recipes.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Running Milabench
=================
Recipes
=======

Base Setup
----------
Expand Down Expand Up @@ -35,11 +35,9 @@ The current setup runs on 8xA100 SXM4 80Go.
Note that some benchmarks do require more than 40Go of VRAM.
One bench might be problematic; rwkv which requires nvcc but can be ignored.

Recipes
-------

Increase Runtime
^^^^^^^^^^^^^^^^
----------------

For profiling it might be useful to run the benchmark for longer than the default configuration.
You can update the yaml file (``config/base.yaml`` or ``config/standard.yaml``) to increase the runtime limits.
Expand All @@ -57,7 +55,7 @@ and ``voir.options.stop`` which represent the target number of observations mila
# an observation is usually a batch forward/backward/optimizer.step (i.e one train step)
One Env
^^^^^^^
-------

If your are using a container with dependencies such as pytorch already installed,
you can force milabench to use a single environment for everything.
Expand All @@ -69,17 +67,17 @@ you can force milabench to use a single environment for everything.
milabench run --use-current-env --select bert-fp32
Batch resizer
^^^^^^^^^^^^^
-------------

If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below.
Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b.

.. code-block:: bash
MILABENCH_SIZER_AUTO=True milabench run
MILABENCH_SIZER_AUTO=1 milabench run
Device Select
^^^^^^^^^^^^^
-------------

To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time
which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster)
Expand All @@ -89,7 +87,7 @@ which might make a run take a bit longer, reducing the number of visible devices
CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run
Update Package
^^^^^^^^^^^^^^
--------------

To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks)

Expand All @@ -100,7 +98,7 @@ To update pytorch to use a newer version of cuda (milabench creates a separate e
pip install -U torch torchvision torchaudio
Arguments
^^^^^^^^^
---------

If environment variables are troublesome, the values can also be passed as arguments.

Expand All @@ -118,6 +116,18 @@ It holds all the benchmark specific logs and metrics gathered by milabench.
zip -r results.zip results
Run a benchmark without milabench
---------------------------------

.. code-block:: bash
milabench dev {benchname} # will open bash with the benchmark venv sourced
# alternatively
source $MILABENCH_BASE/venv/torch/bin/activate
Containers
----------

Expand Down Expand Up @@ -306,6 +316,7 @@ Example Reports
Issues
------

.. code-block:: txt
> Traceback (most recent call last):
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
4 changes: 4 additions & 0 deletions docs/Welcome/Changelog.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Changelog
=========

TBD
54 changes: 54 additions & 0 deletions docs/Welcome/Features.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
Features
========

* non intruisive Instrumentation
* Validation Layers
* Automatic batch resizing
* Docker
* Hardware
* ROCm 5.7
* NVIDIA
* Metrics gathering
* Performance throughput
* GPU util
* CPU util
* IO util


Benchmarks
----------

.. code-block:: text
+--------------------------+-----------+-----------+-------------+-----------+-------------------+
| Benchmark | Unit | Domain | Network | Focus | Task |
+==========================+===========+===========+=============+===========+===================+
| bf16 | TFlops | Synthetic | | Training | |
| fp16 | TFlops | Synthetic | | Training | |
| tf32 | TFlops | Synthetic | | Training | |
| fp32 | TFlops | Synthetic | | Training | |
| bert-fp16 | | NLP | Transformer | Training | Language Modeling |
| bert-fp32 | | NLP | Transformer | Training | Language Modeling |
| bert-tf32 | | NLP | Transformer | Training | Language Modeling |
| bert-tf32-fp16 | | NLP | Transformer | Training | Language Modeling |
| opt-1_3b | | NLP | Transformer | Training | Language Modeling |
| opt-6_7b | | NLP | Transformer | Training | Language Modeling |
| reformer | | NLP | Transformer | Training | Language Modeling |
| rwkv | | NLP | RNN | Training | Language Modeling |
| llama | Token/sec | NLP | Transformer | Inference | Generation |
| dlrm | | NLP | | Training | Recommendation |
| convnext_large-fp16 | img/sec | Vision | Convolution | Training | Classification |
| convnext_large-fp32 | img/sec | Vision | Convolution | Training | Classification |
| convnext_large-tf32 | img/sec | Vision | Convolution | Training | Classification |
| convnext_large-tf32-fp16 | img/sec | Vision | Convolution | Training | Classification |
| davit_large | img/sec | Vision | Transformer | Training | Classification |
| focalnet | | Vision | Convolution | Training | Classification |
| davit_large-multi | img/sec | Vision | Transformer | Training | Classification |
| regnet_y_128gf | img/sec | Vision | Convolution | Training | Classification |
| resnet152 | img/sec | Vision | Convolution | Training | Classification |
| resnet152-multi | img/sec | Vision | Convolution | Training | Classification |
| resnet50 | img/sec | Vision | Convolution | Training | Classification |
| stargan | img/sec | Vision | Convolution | Training | GAN |
| super-slomo | img/sec | Vision | Convolution | Training | |
| t5 | | NLP | Transformer | Training | |
| whisper | | Audio | | Training | |
+--------------------------+-----------+-----------+-------------+-----------+-------------------+
10 changes: 10 additions & 0 deletions docs/Welcome/Roadmap.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Roadmap
=======

* Cloud CI
* ROCm 6.0 - MI300 support
* GPU Max Series - 1550 support
* Evaluate suitability
* Tenstorrent
* Graphcore
* Cerebras
Loading

0 comments on commit f09f396

Please sign in to comment.