Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vLLM inference provider for OpenAI compatible vLLM server #178

Merged
merged 10 commits into from
Oct 21, 2024

Conversation

terrytangyuan
Copy link
Contributor

@terrytangyuan terrytangyuan commented Oct 3, 2024

This PR adds vLLM inference provider for OpenAI compatible vLLM server.

@facebook-github-bot
Copy link

Hi @terrytangyuan!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@russellb
Copy link
Contributor

russellb commented Oct 3, 2024

@terrytangyuan is there anything vllm specific in the implementation? or is it more of a generic adapter to an OpenAI compatible API?

@terrytangyuan
Copy link
Contributor Author

@russellb Thanks for your question! One reason to open this early draft PR is to get opinion on whether to use OpenAI compatible server. Does meta-llama team have any preference? Currently, it's pretty generic but there will likely be vLLM specific handling around chat completion requests, engine configurations, model name mapping, etc.

@russellb
Copy link
Contributor

russellb commented Oct 3, 2024

@russellb Thanks for your question! One reason to open this early draft PR is to get opinion on whether to use OpenAI compatible server. Does meta-llama team have any preference? Currently, it's pretty generic but there will likely be vLLM specific handling around chat completion requests, engine configurations, model name mapping, etc.

just to clarify, I'm not a meta-llama team member. :)

but I'm looking at another approach for vllm -- direct integration using the vllm python APIs. Both seem useful to me, though I was hoping we could get away with a more generic openai API adapter.

What do you mean by engine configuration? Are you thinking there would be some code also responsible for configuring and launching vllm locally?

@terrytangyuan
Copy link
Contributor Author

terrytangyuan commented Oct 3, 2024

just to clarify, I'm not a meta-llama team member. :)

Yep, I meant to tag llama-stack team here @ashwinb @yanxi0830

What do you mean by engine configuration? Are you thinking there would be some code also responsible for configuring and launching vllm locally?

Yes, here's the reference: https://docs.vllm.ai/en/stable/models/engine_args.html

@russellb
Copy link
Contributor

russellb commented Oct 3, 2024

Here's the WIP / draft of the alternative vllm approach that I was looking at: #181

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@terrytangyuan terrytangyuan changed the title WIP: Add vLLM provider WIP: Add vLLM inference provider for OpenAI compatible vLLM server. Oct 8, 2024
@terrytangyuan terrytangyuan changed the title WIP: Add vLLM inference provider for OpenAI compatible vLLM server. Add vLLM inference provider for OpenAI compatible vLLM server. Oct 11, 2024
@terrytangyuan terrytangyuan changed the title Add vLLM inference provider for OpenAI compatible vLLM server. Add vLLM inference provider for OpenAI compatible vLLM server Oct 11, 2024
@terrytangyuan terrytangyuan marked this pull request as ready for review October 11, 2024 01:33
@terrytangyuan
Copy link
Contributor Author

Updated the PR to be consistent with the approach after the refactor in #201.

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about list_models() and a nit about async generators but otherwise looks good

@terrytangyuan
Copy link
Contributor Author

@ashwinb Would you mind taking another look at this? Thanks!

from .config import VLLMImplConfig

VLLM_SUPPORTED_MODELS = {
"Llama3.1-8B": "meta-llama/Llama-3.1-8B",
Copy link
Contributor

@ashwinb ashwinb Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping just:

{ 
  model.descriptor(): model.huggingface_repo 
  for model in llama_models.sku_list.all_registered_models() 
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked how close they are but I borrowed the list used here: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/impls/vllm/vllm.py#L67

So maybe we can refactor these together later.

@ashwinb
Copy link
Contributor

ashwinb commented Oct 15, 2024

I think attaching a test would also be good here. Please see the instructions in providers/tests/inference/test_inference.py and see if that works for you?

@terrytangyuan
Copy link
Contributor Author

terrytangyuan commented Oct 16, 2024

I think attaching a test would also be good here. Please see the instructions in providers/tests/inference/test_inference.py and see if that works for you?

That's a great idea. However, I'll be traveling for conference and probably won't have time to do this soon. Would it be okay to do that in a follow-up PR that adds vLLM to the supported implementations list? We'd like to avoid further conflict resolving after other rounds of refactorings.

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan I really really need to see some test plan for this or else I am afraid we cannot merge this in. It is OK if it is not any automated test. But this PR must show how you tested this yourself and found that it worked.

pass

async def list_models(self) -> List[ModelDef]:
return [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each model you retrieve from the client, you need to check if the model is one of the values of the map you have defined above. otherwise, you are listing an errorneous (unresolvable) model and later things will fail.

@ashwinb
Copy link
Contributor

ashwinb commented Oct 21, 2024

Would it be okay to do that in a follow-up PR that adds vLLM to the supported implementations list? We'd like to avoid further conflict resolving after other rounds of refactorings.

I missed this. OK sure let's do that I will merge it.

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in for now and work on testing, etc. in a follow-up PR

@@ -60,6 +60,15 @@ def available_providers() -> List[ProviderSpec]:
module="llama_stack.providers.adapters.inference.ollama",
),
),
remote_provider_spec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will comment this out for now

@ashwinb ashwinb merged commit a27a2cd into meta-llama:main Oct 21, 2024
3 of 4 checks passed
yanxi0830 pushed a commit that referenced this pull request Oct 21, 2024
This PR adds vLLM inference provider for OpenAI compatible vLLM server.
yanxi0830 added a commit that referenced this pull request Oct 21, 2024
* docker compose ollama

* comment

* update compose file

* readme for distributions

* readme

* move distribution folders

* move distribution/templates to distributions/

* rename

* kill distribution/templates

* readme

* readme

* build/developer cookbook/new api provider

* developer cookbook

* readme

* readme

* [bugfix] fix case for agent when memory bank registered without specifying provider_id (#264)

* fix case where memory bank is registered without provider_id

* memory test

* agents unit test

* Add an option to not use elastic agents for meta-reference inference (#269)

* Allow overridding checkpoint_dir via config

* Small rename

* Make all methods `async def` again; add completion() for meta-reference (#270)

PR #201 had made several changes while trying to fix issues with getting the stream=False branches of inference and agents API working. As part of this, it made a change which was slightly gratuitous. Namely, making chat_completion() and brethren "def" instead of "async def".

The rationale was that this allowed the user (within llama-stack) of this to use it as:

```
async for chunk in api.chat_completion(params)
```

However, it causes unnecessary confusion for several folks. Given that clients (e.g., llama-stack-apps) anyway use the SDK methods (which are completely isolated) this choice was not ideal. Let's revert back so the call now looks like:

```
async for chunk in await api.chat_completion(params)
```

Bonus: Added a completion() implementation for the meta-reference provider. Technically should have been another PR :)

* Improve an important error message

* update ollama for llama-guard3

* Add vLLM inference provider for OpenAI compatible vLLM server (#178)

This PR adds vLLM inference provider for OpenAI compatible vLLM server.

* Create .readthedocs.yaml

Trying out readthedocs

* Update event_logger.py (#275)

spelling error

* vllm

* build templates

* delete templates

* tmp add back build to avoid merge conflicts

* vllm

* vllm

---------

Co-authored-by: Ashwin Bharambe <[email protected]>
Co-authored-by: Ashwin Bharambe <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
Co-authored-by: raghotham <[email protected]>
Co-authored-by: nehal-a2z <[email protected]>
@terrytangyuan
Copy link
Contributor Author

Thank you!

@terrytangyuan terrytangyuan deleted the provider-vllm branch October 23, 2024 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants