-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make use of structured output with vLLM #481
Comments
Nice timing with this, as I've recently been involved in some discussions here. We have some upcoming SDG features that definitely need structured output / guided decoding, although I don't know how much we've explored all the options. I've heard vLLM guided decoding come up as well as instructor-ai, and I think there's some exploration to see which would work best for which use-case. Do you have any high level guidance about how vLLM's structured output support compares with something like instructor-ai and when to use which? |
tl;dr
I have only briefly looked at instructor-ai right now, but it seems the fundamental difference is that instructor-ai works as a client-side library. It seems closer to what the SDG library is doing manually right now for output validation, but more generic and reusable. You'll notice in their overview of features they include things like retry management since they don't have control over what the model responds with. The vLLM approach is a server-side feature. It's deeply integrated into token generation. After running a pass through the model, we have a set of probabilities for the next token. Sampling is where the next token is chosen from those probabilities. What we do is that all tokens that would result in invalid output are masked out at the sampling stage. This provides more of a guarantee that output will match your requirements. Finally, the server-side approach gives you the opportunity to benefit from faster inference (jump decoding) once we have that implemented. One important note for consideration: the client-side approach will work everywhere, including when using llama-cpp for inference. Structured output will require vLLM. |
I see the value in adding this. I feel like we should probably test out both approaches and measure what the quality of the generated data is, and how well models can be trained with it. |
Thanks for the detailed explanation, @russellb! I agree this is definitely something worth exploring, and the client-side vs server-side explanation made things much clearer in my head when thinking about the utility of each. I believe our first use-case coming up that will need some structured output exploration is data annotation, where we take a set of input data and run through an annotation pipeline where we expect the model to respond with one out of a number of fixed annotation categories for each piece of input data. But, there's also plenty of opportunity to explore some of this in our existing pipelines as well, as in for example our |
Sounds good - please feel free to ping me whenever you get around to trying things in vllm as I expect to keep working on it to some degree. |
As of vLLM v0.6.5, there is a new default backend for this feature that performs well enough that I think SDG should consider making use of it.
As a tl;dr of what this does, it allows you to specify output formatting requirements as part of your request. SDG includes a good bit of complexity dealing with expressing output format expectations to the model, validating output to match those expectations, and throwing away responses that fail to meet formatting requirements.
As a simple example, consider a hypothetical example where you're asking a model to produce a simple "yes" or "no" answer. You could specify this as part of your request, guaranteeing that you will only get back "yes" or "no" and nothing else.
As one example, take the following config: https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/configs/skills/evaluate_grounded_pair.yaml
By using this feature, a request to the model could include a jsonschema that says the output must be JSON formatted and include 5 fields: context, question, answers, evaluation, and score. Further, it can require that "score" is an integer. vLLM + xgrammar will guide the model to ensure that the output meets these requirements.
In the near term, I believe this should increase reliability of getting usable responses from the model.
In the longer term, this will actually increase inference performance as we (vLLM) implement a technique called "jump decoding." Using this technique, we can identify cases where we know exactly what the next token(s) must be and don't need the model to figure that out. Using the simple "yes" or "no" example, if the model produced "y" as the first token, we know the rest must be "es" and don't need to do expensive model processing to generate that. By using this feature, you will get these performance boosts automatically if SDG expresses its requirements to vLLM.
For some more detailed information on what this feature does, here's a recent blog post:
The text was updated successfully, but these errors were encountered: