FA3 forward performance regression on H200 #1438

complexfilter · 2025-01-10T23:58:05Z

I did some benchmark test on H200 at bf16 and fp8 precision.

I found that in the forward pass, H200 is slightly slower (4% average). And in the backward pass, H200 is slightly faster (3.5% on average).

I was wondering if the slower forward pass is expected given H200 is more prestigious than H100. Do we need something called FA3.5 that adapts to and exploits H200?

tridao · 2025-01-11T05:26:43Z

What TFLOPS do you get?
Which version of the code (e..g which commit) did you use?

complexfilter · 2025-01-11T06:22:24Z

What TFLOPS do you get? Which version of the code (e..g which commit) did you use?

Having some difficulty to use ncu to compute the FLOPs now. But I have the runtime results:
when (batch_size, num_heads, seq_len, dim)= (1, 8, 32768, 64),

for H100 at forward bf16 task (full attention), I get 4.27728 ms.
for H200 at forward bf16 task (full attention), I get 4.51709 ms.

I used commit: 3cea2fb.

tridao · 2025-01-11T11:31:25Z

Can you try the latest commit?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 forward performance regression on H200 #1438

FA3 forward performance regression on H200 #1438

complexfilter commented Jan 10, 2025

tridao commented Jan 11, 2025

complexfilter commented Jan 11, 2025

tridao commented Jan 11, 2025

FA3 forward performance regression on H200 #1438

FA3 forward performance regression on H200 #1438

Comments

complexfilter commented Jan 10, 2025

tridao commented Jan 11, 2025

complexfilter commented Jan 11, 2025

tridao commented Jan 11, 2025