Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FA3 forward performance regression on H200 #1438

Open
complexfilter opened this issue Jan 10, 2025 · 3 comments
Open

FA3 forward performance regression on H200 #1438

complexfilter opened this issue Jan 10, 2025 · 3 comments

Comments

@complexfilter
Copy link

I did some benchmark test on H200 at bf16 and fp8 precision.

I found that in the forward pass, H200 is slightly slower (4% average). And in the backward pass, H200 is slightly faster (3.5% on average).

I was wondering if the slower forward pass is expected given H200 is more prestigious than H100. Do we need something called FA3.5 that adapts to and exploits H200?

@tridao
Copy link
Contributor

tridao commented Jan 11, 2025

What TFLOPS do you get?
Which version of the code (e..g which commit) did you use?

@complexfilter
Copy link
Author

What TFLOPS do you get? Which version of the code (e..g which commit) did you use?

Having some difficulty to use ncu to compute the FLOPs now. But I have the runtime results:
when (batch_size, num_heads, seq_len, dim)= (1, 8, 32768, 64),

  • for H100 at forward bf16 task (full attention), I get 4.27728 ms.
  • for H200 at forward bf16 task (full attention), I get 4.51709 ms.

I used commit: 3cea2fb.

@tridao
Copy link
Contributor

tridao commented Jan 11, 2025

Can you try the latest commit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants