You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did some benchmark test on H200 at bf16 and fp8 precision.
I found that in the forward pass, H200 is slightly slower (4% average). And in the backward pass, H200 is slightly faster (3.5% on average).
I was wondering if the slower forward pass is expected given H200 is more prestigious than H100. Do we need something called FA3.5 that adapts to and exploits H200?
The text was updated successfully, but these errors were encountered:
What TFLOPS do you get? Which version of the code (e..g which commit) did you use?
Having some difficulty to use ncu to compute the FLOPs now. But I have the runtime results:
when (batch_size, num_heads, seq_len, dim)= (1, 8, 32768, 64),
for H100 at forward bf16 task (full attention), I get 4.27728 ms.
for H200 at forward bf16 task (full attention), I get 4.51709 ms.
I did some benchmark test on H200 at bf16 and fp8 precision.
I found that in the forward pass, H200 is slightly slower (4% average). And in the backward pass, H200 is slightly faster (3.5% on average).
I was wondering if the slower forward pass is expected given H200 is more prestigious than H100. Do we need something called FA3.5 that adapts to and exploits H200?
The text was updated successfully, but these errors were encountered: