diff --git a/docs/config.rst b/docs/Contributing/config.rst similarity index 100% rename from docs/config.rst rename to docs/Contributing/config.rst diff --git a/docs/Contributing/design.rst b/docs/Contributing/design.rst new file mode 100644 index 000000000..c7b1925ad --- /dev/null +++ b/docs/Contributing/design.rst @@ -0,0 +1,49 @@ +Design +====== + +Milabench aims to simulate research workloads for benchmarking purposes. + +* Performance is measured as throughput (samples / secs). + For example, for a model like resnet the throughput would be image per seconds. + +* Single GPU workloads are spawned per GPU to ensure the entire machine is used. + Simulating something similar to a hyper parameter search. + The performance of the benchmark is the sum of throughput of each processes. + +* Multi GPU workloads + +* Multi Nodes + + +Run +--- + +* Milabench Manager Process + * Handles messages from benchmark processes + * Saves messages into a file for future analysis + +* Benchmark processes + * run using ``voir`` + * voir is configured to intercept and send events during the training process + * This allow us to add models from git repositories without modification + * voir sends data through a file descriptor that was created by milabench main process + + +What milabench is +----------------- + +* Training focused +* milabench show candid performance numbers + * No optimization beyond batch size scaling is performed + * we want to measure the performance our researcher will see + not the performance they could get. +* pytorch centric + * Pytorch has become the defacto library for research + * We are looking for accelerator with good maturity that can support + this framework with limited code change. + + +What milabench is not +--------------------- + +* milabench goal is not a performance show case of an accelerator. diff --git a/docs/dev-usage.rst b/docs/Contributing/dev-usage.rst similarity index 100% rename from docs/dev-usage.rst rename to docs/Contributing/dev-usage.rst diff --git a/docs/instrument.rst b/docs/Contributing/instrument.rst similarity index 100% rename from docs/instrument.rst rename to docs/Contributing/instrument.rst diff --git a/docs/new_benchmarks.rst b/docs/Contributing/new_benchmarks.rst similarity index 99% rename from docs/new_benchmarks.rst rename to docs/Contributing/new_benchmarks.rst index e348a28be..d8d72ff44 100644 --- a/docs/new_benchmarks.rst +++ b/docs/Contributing/new_benchmarks.rst @@ -1,6 +1,6 @@ -Creating a new benchmark ------------------------- +Adding a benchmark +================== To define a new benchmark (let's assume it is called ``ornatebench``), diff --git a/docs/flow.rst b/docs/Contributing/overview.rst similarity index 72% rename from docs/flow.rst rename to docs/Contributing/overview.rst index 45f212c46..7aa441325 100644 --- a/docs/flow.rst +++ b/docs/Contributing/overview.rst @@ -1,5 +1,5 @@ -Milabench Overview ------------------- +Overview +======== .. code-block:: txt @@ -230,11 +230,88 @@ Execution Flow * **run_script**: the script will start to run now * **finalize**: tearing down -How do I --------- -* I want to run a benchmark without milabench for debugging purposes - * ``milabench dev {benchname}`` will open bash with the benchmark venv sourced - * alternatively: ``source $MILABENCH_BASE/venv/torch/bin/activate`` +Execution Plan +-------------- + +* milabench main process + * gather metrics from benchmark processes, save them to file + * manages the benchmarks (timeout etc...) + + * if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``) + * each processes log their GPU data + * might spawn a monitor process + * will init pynvml + * dataloader will also spawn process workers + * usually not using GPU + + * if ``njobs`` is used, milabench will launch a single process (torchrun) + * torchrun in turn will spawn one process per GPU + * RANK 0 is used for logging + * RANK 0 might spawn a monitor process + * will init pynvml + * dataloader will also spawn process workers + * usually not using GPU + +per_gpu +^^^^^^^ + +``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark + +.. code-block:: yaml + + _torchvision: + inherits: _defaults + definition: ../benchmarks/torchvision + group: torchvision + install_group: torch + plan: + method: per_gpu + +Milabench will essentially execute something akin to below. + +.. code-block:: bash + + echo "---" + echo "fp16" + echo "====" + time ( + CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & + wait + ) + +njobs +^^^^^ + +``njobs`` used to launch a single jobs that can see all the gpus. + +.. code-block:: yaml + _torchvision_ddp: + inherits: _defaults + definition: ../benchmarks/torchvision_ddp + group: torchvision + install_group: torch + plan: + method: njobs + n: 1 + +Milabench will essentially execute something akin to below. + +.. code-block:: bash + + echo "---" + echo "lightning-gpus" + echo "==============" + time ( + $BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 & + wait + ) diff --git a/docs/process.rst b/docs/Contributing/process.rst similarity index 99% rename from docs/process.rst rename to docs/Contributing/process.rst index 73f83731b..cafc03115 100644 --- a/docs/process.rst +++ b/docs/Contributing/process.rst @@ -8,6 +8,7 @@ Preparing * NVIDIA * AMD + * Intel 2. Create a milabench configuration for your RFP Milabench comes with a wide variety of benchmarks. diff --git a/docs/recipes.rst b/docs/Contributing/recipes.rst similarity index 97% rename from docs/recipes.rst rename to docs/Contributing/recipes.rst index f647ab452..786a24bdf 100644 --- a/docs/recipes.rst +++ b/docs/Contributing/recipes.rst @@ -1,5 +1,5 @@ -Running Milabench -================= +Recipes +======= Base Setup ---------- @@ -35,11 +35,9 @@ The current setup runs on 8xA100 SXM4 80Go. Note that some benchmarks do require more than 40Go of VRAM. One bench might be problematic; rwkv which requires nvcc but can be ignored. -Recipes -------- Increase Runtime -^^^^^^^^^^^^^^^^ +---------------- For profiling it might be useful to run the benchmark for longer than the default configuration. You can update the yaml file (``config/base.yaml`` or ``config/standard.yaml``) to increase the runtime limits. @@ -57,7 +55,7 @@ and ``voir.options.stop`` which represent the target number of observations mila # an observation is usually a batch forward/backward/optimizer.step (i.e one train step) One Env -^^^^^^^ +------- If your are using a container with dependencies such as pytorch already installed, you can force milabench to use a single environment for everything. @@ -69,17 +67,17 @@ you can force milabench to use a single environment for everything. milabench run --use-current-env --select bert-fp32 Batch resizer -^^^^^^^^^^^^^ +------------- If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below. Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b. .. code-block:: bash - MILABENCH_SIZER_AUTO=True milabench run + MILABENCH_SIZER_AUTO=1 milabench run Device Select -^^^^^^^^^^^^^ +------------- To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster) @@ -89,7 +87,7 @@ which might make a run take a bit longer, reducing the number of visible devices CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run Update Package -^^^^^^^^^^^^^^ +-------------- To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks) @@ -100,7 +98,7 @@ To update pytorch to use a newer version of cuda (milabench creates a separate e pip install -U torch torchvision torchaudio Arguments -^^^^^^^^^ +--------- If environment variables are troublesome, the values can also be passed as arguments. @@ -118,6 +116,18 @@ It holds all the benchmark specific logs and metrics gathered by milabench. zip -r results.zip results +Run a benchmark without milabench +--------------------------------- + +.. code-block:: bash + + milabench dev {benchname} # will open bash with the benchmark venv sourced + + # alternatively + + source $MILABENCH_BASE/venv/torch/bin/activate + + Containers ---------- @@ -306,6 +316,7 @@ Example Reports Issues ------ + .. code-block:: txt > Traceback (most recent call last): diff --git a/docs/sizer.rst b/docs/Contributing/sizer.rst similarity index 100% rename from docs/sizer.rst rename to docs/Contributing/sizer.rst diff --git a/docs/docker.rst b/docs/GettingStarted/docker.rst similarity index 100% rename from docs/docker.rst rename to docs/GettingStarted/docker.rst diff --git a/docs/usage.rst b/docs/GettingStarted/usage.rst similarity index 100% rename from docs/usage.rst rename to docs/GettingStarted/usage.rst diff --git a/docs/Welcome/Changelog.rst b/docs/Welcome/Changelog.rst new file mode 100644 index 000000000..7dc58dfe7 --- /dev/null +++ b/docs/Welcome/Changelog.rst @@ -0,0 +1,4 @@ +Changelog +========= + +TBD \ No newline at end of file diff --git a/docs/Welcome/Features.rst b/docs/Welcome/Features.rst new file mode 100644 index 000000000..793bdbe6d --- /dev/null +++ b/docs/Welcome/Features.rst @@ -0,0 +1,54 @@ +Features +======== + +* non intruisive Instrumentation +* Validation Layers +* Automatic batch resizing +* Docker +* Hardware + * ROCm 5.7 + * NVIDIA +* Metrics gathering + * Performance throughput + * GPU util + * CPU util + * IO util + + +Benchmarks +---------- + +.. code-block:: text + +--------------------------+-----------+-----------+-------------+-----------+-------------------+ + | Benchmark | Unit | Domain | Network | Focus | Task | + +==========================+===========+===========+=============+===========+===================+ + | bf16 | TFlops | Synthetic | | Training | | + | fp16 | TFlops | Synthetic | | Training | | + | tf32 | TFlops | Synthetic | | Training | | + | fp32 | TFlops | Synthetic | | Training | | + | bert-fp16 | | NLP | Transformer | Training | Language Modeling | + | bert-fp32 | | NLP | Transformer | Training | Language Modeling | + | bert-tf32 | | NLP | Transformer | Training | Language Modeling | + | bert-tf32-fp16 | | NLP | Transformer | Training | Language Modeling | + | opt-1_3b | | NLP | Transformer | Training | Language Modeling | + | opt-6_7b | | NLP | Transformer | Training | Language Modeling | + | reformer | | NLP | Transformer | Training | Language Modeling | + | rwkv | | NLP | RNN | Training | Language Modeling | + | llama | Token/sec | NLP | Transformer | Inference | Generation | + | dlrm | | NLP | | Training | Recommendation | + | convnext_large-fp16 | img/sec | Vision | Convolution | Training | Classification | + | convnext_large-fp32 | img/sec | Vision | Convolution | Training | Classification | + | convnext_large-tf32 | img/sec | Vision | Convolution | Training | Classification | + | convnext_large-tf32-fp16 | img/sec | Vision | Convolution | Training | Classification | + | davit_large | img/sec | Vision | Transformer | Training | Classification | + | focalnet | | Vision | Convolution | Training | Classification | + | davit_large-multi | img/sec | Vision | Transformer | Training | Classification | + | regnet_y_128gf | img/sec | Vision | Convolution | Training | Classification | + | resnet152 | img/sec | Vision | Convolution | Training | Classification | + | resnet152-multi | img/sec | Vision | Convolution | Training | Classification | + | resnet50 | img/sec | Vision | Convolution | Training | Classification | + | stargan | img/sec | Vision | Convolution | Training | GAN | + | super-slomo | img/sec | Vision | Convolution | Training | | + | t5 | | NLP | Transformer | Training | | + | whisper | | Audio | | Training | | + +--------------------------+-----------+-----------+-------------+-----------+-------------------+ \ No newline at end of file diff --git a/docs/Welcome/Roadmap.rst b/docs/Welcome/Roadmap.rst new file mode 100644 index 000000000..bfc05518f --- /dev/null +++ b/docs/Welcome/Roadmap.rst @@ -0,0 +1,10 @@ +Roadmap +======= + +* Cloud CI +* ROCm 6.0 - MI300 support +* GPU Max Series - 1550 support +* Evaluate suitability + * Tenstorrent + * Graphcore + * Cerebras \ No newline at end of file diff --git a/docs/execution_modes.rst b/docs/execution_modes.rst deleted file mode 100644 index 8d40fc44d..000000000 --- a/docs/execution_modes.rst +++ /dev/null @@ -1,93 +0,0 @@ -Milabench processes overview -============================ - -* milabench main process - * gather metrics from benchmark processes, save them to file - * manages the benchmarks (timeout etc...) - - * if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``) - * each processes log their GPU data - * might spawn a monitor process - * will init pynvml - * dataloader will also spawn process workers - * usually not using GPU - - * if ``njobs`` is used, milabench will launch a single process (torchrun) - * torchrun in turn will spawn one process per GPU - * RANK 0 is used for logging - * RANK 0 might spawn a monitor process - * will init pynvml - * dataloader will also spawn process workers - * usually not using GPU - -Plan ----- - -per_gpu -+++++++ - -``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark - -.. code-block:: yaml - - _torchvision: - inherits: _defaults - definition: ../benchmarks/torchvision - group: torchvision - install_group: torch - plan: - method: per_gpu - -Milabench will essentially execute something akin to below. - -.. code-block:: bash - - echo "---" - echo "fp16" - echo "====" - time ( - CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 & - wait - ) - -njobs -+++++ - -``njobs`` used to launch a single jobs that can see all the gpus. - -.. code-block:: yaml - - _torchvision_ddp: - inherits: _defaults - definition: ../benchmarks/torchvision_ddp - group: torchvision - install_group: torch - plan: - method: njobs - n: 1 - -Milabench will essentially execute something akin to below. - -.. code-block:: bash - - echo "---" - echo "lightning-gpus" - echo "==============" - time ( - $BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 & - wait - ) - - - - - - - diff --git a/docs/index.rst b/docs/index.rst index ebbb27383..b4decf4d7 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -2,18 +2,41 @@ Welcome to milabench's documentation! ===================================== + +.. toctree:: + :caption: News + :maxdepth: 1 + + Welcome/Features + Welcome/Roadmap + Welcome/Changelog + + .. toctree:: :maxdepth: 2 - :caption: Contents: + :caption: Getting Started + + GettingStarted/usage.rst + GettingStarted/docker.rst + - usage.rst - recipes.rst - new_benchmarks.rst +.. toctree:: + :caption: Contributing + :maxdepth: 1 + + Contributing/overview + Contributing/new_benchmarks + Contributing/sizer + Contributing/dev-usage + Contributing/design + Contributing/recipes + + +.. toctree:: + :caption: API + :maxdepth: 1 - docker.rst - dev-usage.rst - reference.rst - sizer.rst + ref-pack.rst Indices and tables