Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does DLRM_v2 support H100? #635

Closed
xyyintel opened this issue Apr 11, 2023 · 3 comments
Closed

Does DLRM_v2 support H100? #635

xyyintel opened this issue Apr 11, 2023 · 3 comments

Comments

@xyyintel
Copy link

Does DLRM_v2 support H100? If supported, what is the env you used?
I have tried cuda11.8 + pytorch 1.14.0 or pytorch 2.1 + torchrec 0.3.2 or torchrec 0.4.0 + fbgemm_gpu 0.3.2 or 0.4.1.
However, none of above env works.

@erichan1
Copy link

We never got to test this on H100 I think. cc @janekl if you've tried on H100.

@janekl
Copy link
Contributor

janekl commented May 4, 2023

Right, the development and testing involved only A100.

To achieve this at least you would need CUDA 12 and compile FBGEMM for Hopper architecture (SM90). But I have never tried this myself.

@ShriyaPalsamudram
Copy link
Contributor

Closing as the reference was not tested on H100s
Note that there were multiple H100 DLRMv2 submissions in the MLPerf Training v4.0 round as shown in the results table.

Training v4.0 implementations are in this repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants