-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HMC on A100 spends large amounts of time in memory copy #378
Comments
Further digging shows that at least part of this is user error—I was compiling with |
Some more notes: After the CG, there are two distinct phases visible in the profile (which can be lined up with the log file):
Still to do:
|
Looking more closely, the wait time appears to be in calls to
As expected this behaves the same in adjoint RHMC as in adjoint HMC.
Still to do—on the first attempt, my laptop couldn't open the |
Since benchmarks show we can get 1.7TFLOP/s for the Wilson kernel on one A100 but only about 230GFLOP/s on an AMD Rome node, it would seem reasonable to expect that the HMC should run faster on the former than the latter. However, this isn't what I currently see in production. The time in the CG inversion does go down, as this is done on the GPU, but the time in the momentum update goes up. Profiling this in NVIDIA Nsight Systems shows that the GPU is very well-utilised in the CG inversion, but then there is a long period of heavy traffic to and from the device during the momentum update (30% host-device, 70% device-host). (I see 10–15% of CUDA usage being in kernels, and 85–90% in memory.)
The tests I've run have been on a 24.24.24.48 lattice, on a single A100 for the GPU tests, and a single 128-core CPU node for the CPU tests, both on Tursa. I've tested SU(3) fundamental, SU(2) fundamental, and SU(2) adjoint, and both RHMC and HMC, and see similar behaviours in all (although I haven't fully profiled every combination).
Is this currently expected behaviour? Is it possible I've made some trivial error in my configuration of Grid, or how I am running it? (The script I'm using closely follows the one in
systems/Tursa
.) Or do I need to recalibrate my expectations? (E.g. perhaps DWF has a much greater ratio of CG to momentum updates, so shows a much larger speedup in HMC?)The text was updated successfully, but these errors were encountered: