-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use low bit KV Cache #721
Comments
Per channel quantization is not supported yet, could be a good feature to have, added to #675 |
I'm looking forward to this realization. With better performance engines like LMDeploy, int8 kv cache can meet both high performance and high precision, better than current fp8_e5m2 and fp8_e4m3 |
You are right. It’s SOTA implementation |
At present, the accuracy of fp8_e5m2 is seriously reduced, and the fp8_e4m3 speed is even slower than the speed of BF16 |
The current flashinfer fp8 kernel should only be used for decode/append, otherwise it's worse than first converting data to f16 then use f16 kernels. Because it internally uses f16 tensor cores. @sitabulaixizawaluduo can you tell us what's your GPU architecture? Currently my bandwidth is limited and I'm prioritizing hopper/blackwell. |
GPU L40 Ada architecture |
In addition, LMDeploy's KV Cache should also be calculated using the fp16 tensor core, and there will be a dequantization step in the middle, which is why it can maintain high accuracy. |
Got it, I think the hyperparameters for Ada is not tuned well, I'll fix that. |
btw, did you observe this slowdown for prefill or decode? If you are talking about decode, you need to enable |
Thanks! I will try it |
When computing with fp8 kv cache, after tensor_core is turned on, does it use fp8 tensor core to calculate first, and then use sm_scale dequantize to fp16/bf16, or use sm_scale dequantize to fp16/bf16, and then use fp16 tensor core to calculate? |
Hi @sitabulaixizawaluduo , it uses f16 tensor cores for both QK and PV. I'm working on the head-wise scale of fp8/int8 KV-Caches, I wonder does the following API work for you:
|
I think this is useful. BTW, if online quantization is used, will the overhead generated have a significant impact on performance? |
You mean online quantization of KV-Cache? I think it will be good to provide an API like https://docs.flashinfer.ai/generated/flashinfer.page.append_paged_kv_cache.html, but fuses with online quantization. |
Does flasher currently support per-head quant kv cache, including fp8_e4m3 and int8?
The text was updated successfully, but these errors were encountered: