Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of structured output with vLLM #481

Open
russellb opened this issue Jan 16, 2025 · 5 comments
Open

Make use of structured output with vLLM #481

russellb opened this issue Jan 16, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@russellb
Copy link
Member

As of vLLM v0.6.5, there is a new default backend for this feature that performs well enough that I think SDG should consider making use of it.

As a tl;dr of what this does, it allows you to specify output formatting requirements as part of your request. SDG includes a good bit of complexity dealing with expressing output format expectations to the model, validating output to match those expectations, and throwing away responses that fail to meet formatting requirements.

As a simple example, consider a hypothetical example where you're asking a model to produce a simple "yes" or "no" answer. You could specify this as part of your request, guaranteeing that you will only get back "yes" or "no" and nothing else.

As one example, take the following config: https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/configs/skills/evaluate_grounded_pair.yaml

By using this feature, a request to the model could include a jsonschema that says the output must be JSON formatted and include 5 fields: context, question, answers, evaluation, and score. Further, it can require that "score" is an integer. vLLM + xgrammar will guide the model to ensure that the output meets these requirements.

In the near term, I believe this should increase reliability of getting usable responses from the model.

In the longer term, this will actually increase inference performance as we (vLLM) implement a technique called "jump decoding." Using this technique, we can identify cases where we know exactly what the next token(s) must be and don't need the model to figure that out. Using the simple "yes" or "no" example, if the model produced "y" as the first token, we know the rest must be "es" and don't need to do expensive model processing to generate that. By using this feature, you will get these performance boosts automatically if SDG expresses its requirements to vLLM.

For some more detailed information on what this feature does, here's a recent blog post:

@bbrowning
Copy link
Contributor

Nice timing with this, as I've recently been involved in some discussions here. We have some upcoming SDG features that definitely need structured output / guided decoding, although I don't know how much we've explored all the options. I've heard vLLM guided decoding come up as well as instructor-ai, and I think there's some exploration to see which would work best for which use-case. Do you have any high level guidance about how vLLM's structured output support compares with something like instructor-ai and when to use which?

@russellb
Copy link
Member Author

russellb commented Jan 16, 2025

tl;dr

  • Your results are going to be more reliable and more performant using vLLM's structured output (a server-side approach deeply integrated into running a model)
  • This server-side approach will not work with llama-cpp. A client-side approach like instructor-ai where you ask nicely, hope, and then do your own parsing and validation will work elsewhere, including llama-cpp.

Nice timing with this, as I've recently been involved in some discussions here. We have some upcoming SDG features that definitely need structured output / guided decoding, although I don't know how much we've explored all the options. I've heard vLLM guided decoding come up as well as instructor-ai, and I think there's some exploration to see which would work best for which use-case. Do you have any high level guidance about how vLLM's structured output support compares with something like instructor-ai and when to use which?

I have only briefly looked at instructor-ai right now, but it seems the fundamental difference is that instructor-ai works as a client-side library. It seems closer to what the SDG library is doing manually right now for output validation, but more generic and reusable. You'll notice in their overview of features they include things like retry management since they don't have control over what the model responds with.

The vLLM approach is a server-side feature. It's deeply integrated into token generation. After running a pass through the model, we have a set of probabilities for the next token. Sampling is where the next token is chosen from those probabilities. What we do is that all tokens that would result in invalid output are masked out at the sampling stage. This provides more of a guarantee that output will match your requirements.

Finally, the server-side approach gives you the opportunity to benefit from faster inference (jump decoding) once we have that implemented.

One important note for consideration: the client-side approach will work everywhere, including when using llama-cpp for inference. Structured output will require vLLM.

@RobotSail
Copy link
Member

I see the value in adding this. I feel like we should probably test out both approaches and measure what the quality of the generated data is, and how well models can be trained with it.

@bbrowning
Copy link
Contributor

Thanks for the detailed explanation, @russellb! I agree this is definitely something worth exploring, and the client-side vs server-side explanation made things much clearer in my head when thinking about the utility of each. I believe our first use-case coming up that will need some structured output exploration is data annotation, where we take a set of input data and run through an annotation pipeline where we expect the model to respond with one out of a number of fixed annotation categories for each piece of input data.

But, there's also plenty of opportunity to explore some of this in our existing pipelines as well, as in for example our full pipeline the LLMBlock calls all expect some degree of structure when parsing the output. It could be what works for cases like annotation where we expect single word outputs that exactly match one item from a predefined list may be different than what works for some other cases where we are looking for specific start and end tags in the output but the output is otherwise open-ended in the actual response content. Definitely something to experiment and see what works best where.

@russellb
Copy link
Member Author

Sounds good - please feel free to ping me whenever you get around to trying things in vllm as I expect to keep working on it to some degree.

@bbrowning bbrowning added the enhancement New feature or request label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants