Releases: mila-iqia/milabench
Releases · mila-iqia/milabench
v1.0.0
Improvements
- Support for XPU and HPU
- New measure methods based on timing
Event
- New compatibility layer for multi vendor support
- New argument placeholders
{arch}
: accelerator arch (cuda, xpu, hpu, etc...){ccl}
: communication collection library (nccl, rccl, ccl, hccl, etc...){cpu_count}
: number of CPU available on the machine{cpu_per_gpu}
: number of CPU available per GPUscpu_count / device_count
{n_worker}
: recommended number of workersmin(cpu_per_gpu, 16)
New Benchmarks
- RL
- brax (jax)
- dqn (jax)
- ppo (jax)
- torchatari (torch)
- Graph (torch geometric)
- dimenet
- recursiongfn
- Vision
- diffusion
- lightning
- dinov2 - vision transformer
- jepa - video
- llm
- llm-lora-single: lora (llama3.1 8B)
- llm-lora-ddp-gpus: ddp + lora (llama3.1 8B)
- llm-lora-ddp-nodes: multi nodes + ddp + lora (llama3.1 8B)
- llm-lora-mp-gpus: mp + lora (llama3.1 70B)
- llm-full-mp-gpus: mp + full (llama3.1 70B)
- llm-full-mp-nodes: multi nodes + mp + full (llama3.1 70B)
- rlhf monogpu
- rlhf multi gpu
- llava
Reference Run - Single Node
=================
Benchmark results
=================
System
------
cpu: AMD EPYC 7742 64-Core Processor
n_cpu: 128
product: NVIDIA A100-SXM4-80GB
n_gpu: 8
memory: 81920.0
Breakdown
---------
bench | fail | n | ngpu | perf | sem% | std% | peak_memory | score | weight
brax | 0 | 1 | 8 | 730035.71 | 0.1% | 0.4% | 2670 | 730035.71 | 1.00
diffusion-gpus | 0 | 1 | 8 | 117.67 | 1.5% | 11.7% | 59944 | 117.67 | 1.00
diffusion-single | 0 | 8 | 1 | 25.02 | 0.8% | 17.9% | 53994 | 202.10 | 1.00
dimenet | 0 | 8 | 1 | 366.85 | 0.7% | 16.2% | 2302 | 2973.32 | 1.00
dinov2-giant-gpus | 0 | 1 | 8 | 445.68 | 0.4% | 3.0% | 69614 | 445.68 | 1.00
dinov2-giant-single | 0 | 8 | 1 | 53.54 | 0.4% | 9.5% | 74646 | 432.65 | 1.00
dqn | 0 | 8 | 1 | 23089954554.91 | 1.1% | 89.9% | 62106 | 184480810548.20 | 1.00
bf16 | 0 | 8 | 1 | 293.43 | 0.2% | 6.3% | 1788 | 2361.16 | 0.00
fp16 | 0 | 8 | 1 | 289.26 | 0.1% | 3.6% | 1788 | 2321.65 | 0.00
fp32 | 0 | 8 | 1 | 19.14 | 0.0% | 0.7% | 2166 | 153.21 | 0.00
tf32 | 0 | 8 | 1 | 146.63 | 0.1% | 3.6% | 2166 | 1177.04 | 0.00
bert-fp16 | 0 | 8 | 1 | 263.73 | 1.1% | 16.7% | nan | 2165.37 | 0.00
bert-fp32 | 0 | 8 | 1 | 44.84 | 0.6% | 9.6% | 21170 | 364.52 | 0.00
bert-tf32 | 0 | 8 | 1 | 141.95 | 0.9% | 14.1% | 1764 | 1162.94 | 0.00
bert-tf32-fp16 | 0 | 8 | 1 | 265.04 | 1.0% | 15.6% | nan | 2175.59 | 3.00
reformer | 0 | 8 | 1 | 62.29 | 0.3% | 6.0% | 25404 | 501.89 | 1.00
t5 | 0 | 8 | 1 | 51.40 | 0.5% | 9.9% | 34390 | 416.14 | 2.00
whisper | 0 | 8 | 1 | 481.95 | 1.0% | 21.4% | 8520 | 3897.53 | 1.00
lightning | 0 | 8 | 1 | 680.22 | 1.0% | 22.7% | 27360 | 5506.90 | 1.00
lightning-gpus | 0 | 1 | 8 | 3504.74 | 7.9% | 62.9% | 28184 | 3504.74 | 1.00
llava-single | 1 | 8 | 1 | 2.28 | 0.4% | 9.6% | 72556 | 14.12 | 1.00
llama | 0 | 8 | 1 | 484.86 | 4.4% | 80.0% | 27820 | 3680.86 | 1.00
llm-full-mp-gpus | 0 | 1 | 8 | 193.92 | 3.1% | 16.2% | 48470 | 193.92 | 1.00
llm-lora-ddp-gpus | 0 | 1 | 8 | 16738.58 | 0.4% | 2.0% | 36988 | 16738.58 | 1.00
llm-lora-mp-gpus | 0 | 1 | 8 | 1980.63 | 2.2% | 11.8% | 55972 | 1980.63 | 1.00
llm-lora-single | 0 | 8 | 1 | 2724.95 | 0.2% | 3.0% | 49926 | 21861.99 | 1.00
ppo | 0 | 8 | 1 | 3114264.32 | 1.6% | 57.2% | 62206 | 24915954.98 | 1.00
recursiongfn | 0 | 8 | 1 | 7080.67 | 1.2% | 27.1% | 10292 | 57038.34 | 1.00
rlhf-gpus | 0 | 1 | 8 | 6314.94 | 2.1% | 11.2% | 21730 | 6314.94 | 1.00
rlhf-single | 0 | 8 | 1 | 1143.72 | 0.4% | 8.4% | 19566 | 9174.52 | 1.00
focalnet | 0 | 8 | 1 | 375.07 | 0.7% | 14.9% | 23536 | 3038.83 | 2.00
torchatari | 0 | 8 | 1 | 5848.88 | 0.6% | 12.7% | 3834 | 46613.34 | 1.00
convnext_large-fp16 | 0 | 8 | 1 | 330.93 | 1.5% | 22.9% | 27376 | 2711.46 | 0.00
convnext_large-fp32 | 0 | 8 | 1 | 59.49 | 0.6% | 9.8% | 55950 | 483.84 | 0.00
convnext_large-tf32 | 0 | 8 | 1 | 155.41 | 0.9% | 14.3% | 49650 | 1273.31 | 0.00
convnext_large-tf32-fp16 | 0 | 8 | 1 | 322.28 | 1.6% | 24.5% | 27376 | 2637.88 | 3.00
regnet_y_128gf | 0 | 8 | 1 | 119.46 | 0.5% | 10.0% | 29762 | 966.96 | 2.00
resnet152-ddp-gpus | 0 | 1 | 8 | 3843.06 | 5.2% | 39.3% | 27980 | 3843.06 | 0.00
resnet50 | 0 | 8 | 1 | 932.95 | 2.4% | 52.2% | 14848 | 7524.25 | 1.00
resnet50-noio | 0 | 8 | 1 | 1163.88 | 0.3% | 6.7% | 27480 | 9385.35 | 0.00
vjepa-gpus | 0 | 1 | 8 | 130.13 | 5.9% | 46.8% | 64244 | 130.13 | 1.00
vjepa-single | 0 | 8 | 1 | 21.29 | 1.0% | 22.4% | 58552 | 172.11 | 1.00
Scores
------
Failure rate: 0.38% (PASS)
Score: 4175.57
Errors
------
1 errors, details in HTML report.
What's Changed
- Improve exception parsing by @Delaunay in #222
- enable long trace by default by @Delaunay in #223
- Live report by @Delaunay in #146
- Do NOT run pretrained llama by @Delaunay in #227
- Add worker resolution by @Delaunay in #225
- Update observer.py by @Delaunay in #230
- Phase lock by @Delaunay in #228
- Multi node check by @Delaunay in #234
- update templates by @Delaunay in #235
- New lightning bench by @Delaunay in #236
- Update scaling.yaml by @Delaunay in #229
- Dino by @Delaunay in #238
- Update recipes.rst by @Delaunay in #242
- Llama 3 by @Delaunay in #240
- Initial commit Torch_PPO_Cleanrl_Atari_Envpool by @roger-creus in #243
- recursiongfn benchmark by @josephdviviano in #249
- Multi node tweaks by @Delaunay in #248
- Create execution_modes.rst by @Delaunay in #241
- Add Dimenet by @Delaunay in #251
- Rlhf 2 by @Delaunay in #253
- Benchmark Batch by @Delaunay in #252
- Update pins for CUDA by @Delaunay in #259
- Rl argparse by @Delaunay in #264
- Cleanrl jax by @Delaunay in #263
- Tweaks 3 by @Delaunay in #261
- Staging by @Delaunay in #265
- Adding LlaVa by @rabiulcste in #266
- Fix diffusion by @satyaog in #267
- Attempt fix on dinov2-giant-nodes by @satyaog in #268
- Generate llama instead of downloading it by @satyaog in #250
- Staging by @Delaunay in #269
- Update pins by @Delaunay in #272
- new RLHF benchmark by @Delaunay in #273
- Rlhf hf by @Delaunay in #275
- Fixes loss NaN issue for LlaVa by @rabiulcste in #279
- Geo gnn fixes by @bouthilx in #284
- Staging by @Delaunay in #283
- Sync Stable with mas...
v1.0.0 RC1
Improvements
- Support for XPU and HPU
- New measure methods based on timing
Event
- New compatibility layer for multi vendor support
- New argument placeholders
{arch}
: accelerator arch (cuda, xpu, hpu, etc...){ccl}
: communication collection library (nccl, rccl, ccl, hccl, etc...){cpu_count}
: number of CPU available on the machine{cpu_per_gpu}
: number of CPU available per GPUscpu_count / device_count
{n_worker}
: recommended number of workersmin(cpu_per_gpu, 16)
New Benchmarks
- RL
- brax (jax)
- dqn (jax)
- ppo (jax)
- torchatari (torch)
- Graph (torch geometric)
- dimenet
- recursiongfn
- Vision
- diffusion
- lightning
- dinov2 - vision transformer
- jepa - video
- llm
- llm-lora-single: lora (llama3.1 8B)
- llm-lora-ddp-gpus: ddp + lora (llama3.1 8B)
- llm-lora-ddp-nodes: multi nodes + ddp + lora (llama3.1 8B)
- llm-lora-mp-gpus: mp + lora (llama3.1 70B)
- llm-full-mp-gpus: mp + full (llama3.1 70B)
- llm-full-mp-nodes: multi nodes + mp + full (llama3.1 70B)
- rlhf monogpu
- rlhf multi gpu
- llava
What's Changed
- Improve exception parsing by @Delaunay in #222
- enable long trace by default by @Delaunay in #223
- Live report by @Delaunay in #146
- Do NOT run pretrained llama by @Delaunay in #227
- Add worker resolution by @Delaunay in #225
- Update observer.py by @Delaunay in #230
- Phase lock by @Delaunay in #228
- Multi node check by @Delaunay in #234
- update templates by @Delaunay in #235
- New lightning bench by @Delaunay in #236
- Update scaling.yaml by @Delaunay in #229
- Dino by @Delaunay in #238
- Update recipes.rst by @Delaunay in #242
- Llama 3 by @Delaunay in #240
- Initial commit Torch_PPO_Cleanrl_Atari_Envpool by @roger-creus in #243
- recursiongfn benchmark by @josephdviviano in #249
- Multi node tweaks by @Delaunay in #248
- Create execution_modes.rst by @Delaunay in #241
- Add Dimenet by @Delaunay in #251
- Benchmark Batch by @Delaunay in #252
- Update pins for CUDA by @Delaunay in #259
- Rl argparse by @Delaunay in #264
- Cleanrl jax by @Delaunay in #263
- Tweaks 3 by @Delaunay in #261
- Adding LlaVa by @rabiulcste in #266
- Fix diffusion by @satyaog in #267
- Generate llama instead of downloading it by @satyaog in #250
- Update pins by @Delaunay in #272
- new RLHF benchmark by @Delaunay in #273
- Fixes loss NaN issue for LlaVa by @rabiulcste in #279
- Geo gnn fixes by @bouthilx in #284
- Sync Stable with master by @Delaunay in #143
New Contributors
- @roger-creus made their first contribution in #243
- @josephdviviano made their first contribution in #249
- @rabiulcste made their first contribution in #266
Reference Run - Two Node
Reference Run - Single Node
=================
Benchmark results
=================
System
------
cpu: AMD EPYC 7742 64-Core Processor
n_cpu: 128
product: NVIDIA A100-SXM4-80GB
n_gpu: 8
memory: 81920.0
Breakdown
---------
bench | fail | n | ngpu | perf | sem% | std% | peak_memory | score | weight
brax | 0 | 1 | 8 | 730035.71 | 0.1% | 0.4% | 2670 | 730035.71 | 1.00
diffusion-gpus | 0 | 1 | 8 | 117.67 | 1.5% | 11.7% | 59944 | 117.67 | 1.00
diffusion-single | 0 | 8 | 1 | 25.02 | 0.8% | 17.9% | 53994 | 202.10 | 1.00
dimenet | 0 | 8 | 1 | 366.85 | 0.7% | 16.2% | 2302 | 2973.32 | 1.00
dinov2-giant-gpus | 0 | 1 | 8 | 445.68 | 0.4% | 3.0% | 69614 | 445.68 | 1.00
dinov2-giant-single | 0 | 8 | 1 | 53.54 | 0.4% | 9.5% | 74646 | 432.65 | 1.00
dqn | 0 | 8 | 1 | 23089954554.91 | 1.1% | 89.9% | 62106 | 184480810548.20 | 1.00
bf16 | 0 | 8 | 1 | 293.43 | 0.2% | 6.3% | 1788 | 2361.16 | 0.00
fp16 | 0 | 8 | 1 | 289.26 | 0.1% | 3.6% | 1788 | 2321.65 | 0.00
fp32 | 0 | 8 | 1 | 19.14 | 0.0% | 0.7% | 2166 | 153.21 | 0.00
tf32 | 0 | 8 | 1 | 146.63 | 0.1% | 3.6% | 2166 | 1177.04 | 0.00
bert-fp16 | 0 | 8 | 1 | 263.73 | 1.1% | 16.7% | nan | 2165.37 | 0.00
bert-fp32 | 0 | 8 | 1 | 44.84 | 0.6% | 9.6% | 21170 | 364.52 | 0.00
bert-tf32 | 0 | 8 | 1 | 141.95 | 0.9% | 14.1% | 1764 | 1162.94 | 0.00
bert-tf32-fp16 | 0 | 8 | 1 | 265.04 | 1.0% | 15.6% | nan | 2175.59 | 3.00
reformer | 0 | 8 | 1 | 62.29 | 0.3% | 6.0% | 25404 | 501.89 | 1.00
t5 | 0 | 8 | 1 | 51.40 | 0.5% | 9.9% | 34390 | 416.14 | 2.00
whisper | 0 | 8 | 1 | 481.95 | 1.0% | 21.4% | 8520 | 3897.53 | 1.00
lightning | 0 | 8 | 1 | 680.22 | 1.0% | 22.7% | 27360 | 5506.90 | 1.00
lightning-gpus | 0 | 1 | 8 | 3504.74 | 7.9% | 62.9% | 28184 | 3504.74 | 1.00
llava-single | 1 | 8 | 1 | 2.28 | 0.4% | 9.6% | 72556 | 14.12 | 1.00
llama | 0 | 8 | 1 | 484.86 | 4.4% | 80.0% | 27820 | 3680.86 | 1.00
llm-full-mp-gpus | 0 | 1 | 8 | 193.92 | 3.1% | 16.2% | 48470 | 193.92 | 1.00
llm-lora-ddp-gpus | 0 | 1 | 8 | 16738.58 | 0.4% | 2.0% | 36988 | 16738.58 | 1.00
llm-lora-mp-gpus | 0 | 1 | 8 | 1980.63 | 2.2% | 11.8% | 55972 | 1980.63 | 1.00
llm-lora-single | 0 | 8 | 1 | 2724.95 | 0.2% | 3.0% | 49926 | 21861.99 | 1.00
ppo | 0 | 8 | 1 | 3114264.32 | 1.6% | 57.2% | 62206 | 24915954.98 | 1.00
recursiongfn | 0 | 8 | 1 | 7080.67 | 1.2% | 27.1% | 10292 | 57038.34 | 1.00
rlhf-gpus | 0 | 1 | 8 | 6314.94 | 2.1% | 11.2% | 21730 | 6314.94 | 1.00
rlhf-single | 0 | 8 | 1 | 1143.72 | 0.4% | 8.4% | 19566 | 9174.52 | 1.00
focalnet | 0 | 8 | 1 | 375.07 | 0.7% | 14.9% | 23536 | 3038.83 | 2.00
torchatari | 0 | 8 | 1 | 5848.88 | 0.6% | 12.7% | 3834 | 46613.34 | 1.00
convnext_large-fp16 | 0 | 8 | 1 | 330.93 | 1.5% | 22.9% | 27376 | 2711.46 | 0.00
convnext_large-fp32 | 0 | 8 | 1 | 59.49 | 0.6% | 9.8% | 55950 | 483.84 | 0.00
convnext_large-tf32 | 0 | 8 | 1 | 155.41 | 0.9% | 14.3% | 49650 | 1273.31 | 0.00
convnext_large-tf32-fp16 | 0 | 8 | 1 | 322.28 | 1.6% | 24.5% | 27376 | 2637.88 | 3.00
regnet_y_128gf | 0 | 8 | 1 | 119.46 | 0.5% | 10.0% | 29762 | 966.96 | 2.00
resnet152-ddp-gpus | 0 | 1 | 8 | 3843.06 | 5.2% | 39.3% | 27980 | 3843.06 | 0.00
resnet50 | 0 | 8 | 1 | 932.95 | 2.4% | 52.2% | 14848 | 7524.25 | 1.00
resnet50-noio | 0 | 8 | 1 | 1163.88 | 0.3% | 6.7% | 27480 | 9385.35 | 0.00
vjepa-gpus | 0 | 1 | 8 | 130.13 | 5.9% | 46.8% | 64244 | 130.13 | 1.00
vjepa-single | 0 | 8 | 1 | 21.29 | 1.0% | 22.4% | 58552 | 172.11 | 1.00
Scores
------
Failure rate: 0.38% (PASS)
Score: 4175.57
Errors
------
1 errors, details in HTML report.
Full Changelog: https://github.com/mila-iqia/mila...
v0.1.0
What's Changed
- Update ROCm docker command by @Delaunay in #124
- Validation Layers by @Delaunay in #56
- Update voir to 2.15 by @Delaunay in #135
- Make requirement updates more stable by @breuleux in #133
- Fix instability in Whisper benchmark by @breuleux in #137
- Disable ROCm tests while we find a suitable machine replacement by @Delaunay in #140
- Add system config by @satyaog in #130
- Add metadata gathering by @Delaunay in #132
- New Metric Persistence Backend by @Delaunay in #58
- Add RWKV benchmark by @breuleux in #90
- Tweak Performance computation by @Delaunay in #144
- Add execution plan abstraction by @satyaog in #145
- Make error validation work with python exception by @Delaunay in #154
- Multi node install & prepare by @Delaunay in #153
- Docker tweaks by @Delaunay in #160
- Fix #164: makes sure all the timeout tasks are cancelled by @Delaunay in #165
- Prevent machine_metadata from throwing by @Delaunay in #163
- remove dlrm profiling by @Delaunay in #168
- Add flops benchmark by @Delaunay in #169
- Add new inference bench by @Delaunay in #174
- Autoscale by @Delaunay in #177
- Fix ${{}} in runner by @Delaunay in #175
- Use black by @Delaunay in #178
- Track pytorch version by @Delaunay in #155
- Deploy script by @Delaunay in #182
- Add dataset revision by @Delaunay in #187
- Update README.md by @Delaunay in #180
- Use node["port"] to ssh to the node by @Delaunay in #189
- Build docker container for reporting by @Delaunay in #197
- Add missing property by @Delaunay in #198
- Add git to docker by @Delaunay in #199
- Simplify name by @Delaunay in #200
- Make sure report works without GPU by @Delaunay in #201
- Tag report containers by @Delaunay in #202
- Update README.md by @Delaunay in #209
- Intel GPU Max Support + Gaudi by @Delaunay in #214
- Add Benchmate: benchmark companion lib
- Support for XPU and HPU
- New measure methods based on timing Event
- New compatibility layer for multi vendor support
- New argument placeholders
{arch}
: accelerator arch (cuda, xpu, hpu, etc...)
{ccl}
: communication collection library (nccl, rccl, ccl, hccl, etc...)
{cpu_count}
: number of CPU available on the machine
{cpu_per_gpu}
: number of CPU available per GPUs cpu_count / device_count
{n_worker}
: recommended number of workers min(cpu_per_gpu, 16)
Full Changelog: v0.0.6...v0.1.0
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
Update Monitor call (#123) Co-authored-by: Pierre Delaunay <[email protected]>