-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operators support for updated speculative inference design #8
Comments
For the second question, it seems |
I believe there are different cuda kernels for prefilling and decoding tokens already implemented by @xinhaoc, but those kernels can still be the same operator.. As discussed with @jiazhihao, we want to use those kernels because they are optimized for their use case. |
For speculative decoding, currently we have |
Yes, in CUDA we have different kernel within one multihead_attention operator. What I mean is "should we split them into two operators like higher-level does?" (Now I think it seems no need :P) |
For tree verification, do we have similar method to figure whether in prompt phase? |
Related issues
#9 #14 #13
Description
We proposed the inference implementation refactoring which mainly involves
Pipeline Split
andStruct Smplification
, and this result some issues to discuss in operators (kernel) changes. I would list them here, and if I miss out something please feel free to correct me~1. For splitting prefilling and decode stages
Previously we mix
prompt
phase andgeneration
phase of caclution in one inference kernel (spec_inc_multihead_self_attention
ortree_inc_multihead_self_attention
). To support split stages we should also spilt mixed caclution.But here's a problem. should we provide prompt and generation as two distinct inference kernel ops,
or still provide one op while do conditional branch within it for different stage calculation.
The former approach would force change in operators DAG so I think is not good.
2. For smplifing
BatchConfig
structureTrivial changes are adopted. But I haven't fully figured out how we switch from
BeamSearchBC
toTreeSearchBC
.In BeamSearch version, the last layer of ssm is
beam_topk
and its output is stored inBeamInferenceResult
(usingdownload_tensor
). And in TreeSearch versionSsmInferenceResult
is the same asBeamInferenceResult
, so I guess we will still usebeam_topk
.But
beam_topk
use some fields likesub_requests
,beamRequestsInfo::probs
, which removed from updatedTreeSearchBC
. Maybe we can discuss how to adapt it.The text was updated successfully, but these errors were encountered: