Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROPE problem in MLA decode kernel #730

Open
lsw825 opened this issue Jan 9, 2025 · 2 comments
Open

ROPE problem in MLA decode kernel #730

lsw825 opened this issue Jan 9, 2025 · 2 comments

Comments

@lsw825
Copy link

lsw825 commented Jan 9, 2025

In flashinfer.decode.BatchDecodeMlaWithPagedKVCacheWrapper, "rope_theta" and "rope_scaling" are params with default value None. However, after reading triton interface code, I found that rope_theta will be set as 1e4 if it is None. Besides, it seems to be a llama-like linear rope scaling interpolation, rather than "yarn" in deepseek model.

I think it will be better if I can disable rope calculation in the MLA decode kernel, and just apply rope before the kernel, then store k_pe with rope in the kv cache.

Is there any misunderstanding in my previous description? If not, is there any easy way to disable rope in the current kernel?

Thanks a lot.

@yzh119
Copy link
Collaborator

yzh119 commented Jan 10, 2025

Thanks for your suggestions, it would be easy to remove pe from the implementation, will do that later.
We are working on a faster version of MLA decoding kernels that uses tensor cores, would you mind leaving some suggestions on the user interface?

@lsw825
Copy link
Author

lsw825 commented Jan 10, 2025

Sure. I'm very interested in MLA kernel. You can contact me via the email in my profile if needed.

If I have any other suggestions that are different to the current interface, I'll also propose them via git issues.

Thx:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants