Does DLRM_v2 support H100? #635

xyyintel · 2023-04-11T03:21:25Z

Does DLRM_v2 support H100? If supported, what is the env you used?
I have tried cuda11.8 + pytorch 1.14.0 or pytorch 2.1 + torchrec 0.3.2 or torchrec 0.4.0 + fbgemm_gpu 0.3.2 or 0.4.1.
However, none of above env works.

erichan1 · 2023-04-28T22:24:00Z

We never got to test this on H100 I think. cc @janekl if you've tried on H100.

janekl · 2023-05-04T14:18:02Z

Right, the development and testing involved only A100.

To achieve this at least you would need CUDA 12 and compile FBGEMM for Hopper architecture (SM90). But I have never tried this myself.

ShriyaPalsamudram · 2024-07-31T20:20:26Z

Closing as the reference was not tested on H100s
Note that there were multiple H100 DLRMv2 submissions in the MLPerf Training v4.0 round as shown in the results table.

Training v4.0 implementations are in this repo

ShriyaPalsamudram closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does DLRM_v2 support H100? #635

Does DLRM_v2 support H100? #635

xyyintel commented Apr 11, 2023

erichan1 commented Apr 28, 2023

janekl commented May 4, 2023

ShriyaPalsamudram commented Jul 31, 2024

Does DLRM_v2 support H100? #635

Does DLRM_v2 support H100? #635

Comments

xyyintel commented Apr 11, 2023

erichan1 commented Apr 28, 2023

janekl commented May 4, 2023

ShriyaPalsamudram commented Jul 31, 2024