You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🤔Recently, I noticed that FlashInfer has been updated to support the sm90 Hopper architecture. The BatchPrefillWithPagedKVCacheDispatched (prefill.cuh) called within BatchPrefillWithPagedKVCacheWrapper has a counterpart implementation for sm90 (prefill_sm90.cuh).
Although they share the same function name, their parameters and templates are not identical.
Based on my observations, the sm90 version does not appear to be integrated into BatchPrefillWithPagedKVCacheWrapper , and there are no new tests or benchmarks added for batch decode.
😊I am very interested in knowing whether you have attempted to use BatchPrefillWithPagedKVCacheDispatched for batch decode computation so far. In the future, will it be possible to enable the sm90 implementation without modifying my existing code, or with only minor changes?Thank you very much for your time and effort. I look forward to your response.
Best regards😇
The text was updated successfully, but these errors were encountered:
@tangcy98 the sm90 version was integrated into BatchPrefillWithPagedKVCacheWrapper but not BatchDecodeWithPagedKVCacheWrapper, becaues the minimal wgmma size on query dimension is 64, while the unpacked (after head-group fusion in Appendix 1) query length for decoding is at most num_qo_heads / num_kv_heads for decoding, so most of the flops are wasted. As a comparison, the minimal tile size of our fa2 template is 16. So I keep using the fa2 template for decoding, using fa3 template might improve a little bit but I haven't verified it yet.
Hello😀
I am currently implementing batch decode computation using the
BatchPrefillHandler
andBatchPrefillWithPagedKVCacheWrapper
in FlashInfer.flashinfer/src/flashinfer_ops.cuh
Lines 439 to 445 in a0e99a3
And I mainly refer to src/bench_batch_decode.cu.
flashinfer/src/bench_batch_decode.cu
Lines 94 to 160 in a0e99a3
🤔Recently, I noticed that FlashInfer has been updated to support the sm90 Hopper architecture. The
BatchPrefillWithPagedKVCacheDispatched
(prefill.cuh) called within BatchPrefillWithPagedKVCacheWrapper has a counterpart implementation for sm90 (prefill_sm90.cuh).Although they share the same function name, their parameters and templates are not identical.
See
flashinfer/include/flashinfer/attention/prefill.cuh
Lines 2214 to 2218 in a0e99a3
and
flashinfer/include/flashinfer/attention/hopper/prefill_sm90.cuh
Lines 496 to 499 in a0e99a3
Based on my observations, the sm90 version does not appear to be integrated into
BatchPrefillWithPagedKVCacheWrapper
, and there are no new tests or benchmarks added for batch decode.😊I am very interested in knowing whether you have attempted to use
BatchPrefillWithPagedKVCacheDispatched
for batch decode computation so far. In the future, will it be possible to enable the sm90 implementation without modifying my existing code, or with only minor changes?Thank you very much for your time and effort. I look forward to your response.Best regards😇
The text was updated successfully, but these errors were encountered: