Add vLLM inference provider for OpenAI compatible vLLM server #178

terrytangyuan · 2024-10-03T16:22:54Z

This PR adds vLLM inference provider for OpenAI compatible vLLM server.

facebook-github-bot · 2024-10-03T16:23:00Z

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

russellb · 2024-10-03T17:27:33Z

@terrytangyuan is there anything vllm specific in the implementation? or is it more of a generic adapter to an OpenAI compatible API?

terrytangyuan · 2024-10-03T17:48:08Z

@russellb Thanks for your question! One reason to open this early draft PR is to get opinion on whether to use OpenAI compatible server. Does meta-llama team have any preference? Currently, it's pretty generic but there will likely be vLLM specific handling around chat completion requests, engine configurations, model name mapping, etc.

russellb · 2024-10-03T17:55:19Z

@russellb Thanks for your question! One reason to open this early draft PR is to get opinion on whether to use OpenAI compatible server. Does meta-llama team have any preference? Currently, it's pretty generic but there will likely be vLLM specific handling around chat completion requests, engine configurations, model name mapping, etc.

just to clarify, I'm not a meta-llama team member. :)

but I'm looking at another approach for vllm -- direct integration using the vllm python APIs. Both seem useful to me, though I was hoping we could get away with a more generic openai API adapter.

What do you mean by engine configuration? Are you thinking there would be some code also responsible for configuring and launching vllm locally?

terrytangyuan · 2024-10-03T18:40:17Z

just to clarify, I'm not a meta-llama team member. :)

Yep, I meant to tag llama-stack team here @ashwinb @yanxi0830

What do you mean by engine configuration? Are you thinking there would be some code also responsible for configuring and launching vllm locally?

Yes, here's the reference: https://docs.vllm.ai/en/stable/models/engine_args.html

russellb · 2024-10-03T22:56:36Z

Here's the WIP / draft of the alternative vllm approach that I was looking at: #181

facebook-github-bot · 2024-10-06T15:07:22Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

terrytangyuan · 2024-10-11T16:13:18Z

Updated the PR to be consistent with the approach after the refactor in #201.

ashwinb

Comment about list_models() and a nit about async generators but otherwise looks good

llama_stack/providers/adapters/inference/vllm/vllm.py

terrytangyuan · 2024-10-15T21:18:42Z

@ashwinb Would you mind taking another look at this? Thanks!

ashwinb · 2024-10-15T23:11:56Z

llama_stack/providers/adapters/inference/vllm/vllm.py

+from .config import VLLMImplConfig
+
+VLLM_SUPPORTED_MODELS = {
+    "Llama3.1-8B": "meta-llama/Llama-3.1-8B",


is this mapping just:

{ model.descriptor(): model.huggingface_repo for model in llama_models.sku_list.all_registered_models() }

I haven't checked how close they are but I borrowed the list used here: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/impls/vllm/vllm.py#L67

So maybe we can refactor these together later.

ashwinb · 2024-10-15T23:13:21Z

I think attaching a test would also be good here. Please see the instructions in providers/tests/inference/test_inference.py and see if that works for you?

terrytangyuan · 2024-10-16T01:01:01Z

I think attaching a test would also be good here. Please see the instructions in providers/tests/inference/test_inference.py and see if that works for you?

That's a great idea. However, I'll be traveling for conference and probably won't have time to do this soon. Would it be okay to do that in a follow-up PR that adds vLLM to the supported implementations list? We'd like to avoid further conflict resolving after other rounds of refactorings.

Signed-off-by: Yuan Tang <[email protected]>

ashwinb

@terrytangyuan I really really need to see some test plan for this or else I am afraid we cannot merge this in. It is OK if it is not any automated test. But this PR must show how you tested this yourself and found that it worked.

ashwinb · 2024-10-21T01:05:19Z

llama_stack/providers/adapters/inference/vllm/vllm.py

+        pass
+
+    async def list_models(self) -> List[ModelDef]:
+        return [


for each model you retrieve from the client, you need to check if the model is one of the values of the map you have defined above. otherwise, you are listing an errorneous (unresolvable) model and later things will fail.

ashwinb · 2024-10-21T01:40:26Z

Would it be okay to do that in a follow-up PR that adds vLLM to the supported implementations list? We'd like to avoid further conflict resolving after other rounds of refactorings.

I missed this. OK sure let's do that I will merge it.

ashwinb

Let's get this in for now and work on testing, etc. in a follow-up PR

ashwinb · 2024-10-21T01:41:43Z

llama_stack/providers/registry/inference.py

@@ -60,6 +60,15 @@ def available_providers() -> List[ProviderSpec]:
                module="llama_stack.providers.adapters.inference.ollama",
            ),
        ),
+        remote_provider_spec(


I think I will comment this out for now

This PR adds vLLM inference provider for OpenAI compatible vLLM server.

* docker compose ollama * comment * update compose file * readme for distributions * readme * move distribution folders * move distribution/templates to distributions/ * rename * kill distribution/templates * readme * readme * build/developer cookbook/new api provider * developer cookbook * readme * readme * [bugfix] fix case for agent when memory bank registered without specifying provider_id (#264) * fix case where memory bank is registered without provider_id * memory test * agents unit test * Add an option to not use elastic agents for meta-reference inference (#269) * Allow overridding checkpoint_dir via config * Small rename * Make all methods `async def` again; add completion() for meta-reference (#270) PR #201 had made several changes while trying to fix issues with getting the stream=False branches of inference and agents API working. As part of this, it made a change which was slightly gratuitous. Namely, making chat_completion() and brethren "def" instead of "async def". The rationale was that this allowed the user (within llama-stack) of this to use it as: ``` async for chunk in api.chat_completion(params) ``` However, it causes unnecessary confusion for several folks. Given that clients (e.g., llama-stack-apps) anyway use the SDK methods (which are completely isolated) this choice was not ideal. Let's revert back so the call now looks like: ``` async for chunk in await api.chat_completion(params) ``` Bonus: Added a completion() implementation for the meta-reference provider. Technically should have been another PR :) * Improve an important error message * update ollama for llama-guard3 * Add vLLM inference provider for OpenAI compatible vLLM server (#178) This PR adds vLLM inference provider for OpenAI compatible vLLM server. * Create .readthedocs.yaml Trying out readthedocs * Update event_logger.py (#275) spelling error * vllm * build templates * delete templates * tmp add back build to avoid merge conflicts * vllm * vllm --------- Co-authored-by: Ashwin Bharambe <[email protected]> Co-authored-by: Ashwin Bharambe <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: raghotham <[email protected]> Co-authored-by: nehal-a2z <[email protected]>

terrytangyuan · 2024-10-21T22:40:22Z

Thank you!

terrytangyuan force-pushed the provider-vllm branch from 15b5e4e to 6e18080 Compare October 3, 2024 17:02

russellb mentioned this pull request Oct 3, 2024

Ad vLLM provider support #142

Open

russellb mentioned this pull request Oct 3, 2024

Inline vLLM inference provider #181

Merged

terrytangyuan force-pushed the provider-vllm branch from ff06c6d to e814197 Compare October 4, 2024 02:05

russellb mentioned this pull request Oct 5, 2024

Create shared openai-compatible inference adapter #193

Open

terrytangyuan mentioned this pull request Oct 6, 2024

Add generic OpenAI compatible inference provider #195

Closed

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 6, 2024

terrytangyuan changed the title ~~WIP: Add vLLM provider~~ WIP: Add vLLM inference provider for OpenAI compatible vLLM server. Oct 8, 2024

terrytangyuan force-pushed the provider-vllm branch from e814197 to 409dad4 Compare October 11, 2024 00:39

terrytangyuan changed the title ~~WIP: Add vLLM inference provider for OpenAI compatible vLLM server.~~ Add vLLM inference provider for OpenAI compatible vLLM server. Oct 11, 2024

terrytangyuan changed the title ~~Add vLLM inference provider for OpenAI compatible vLLM server.~~ Add vLLM inference provider for OpenAI compatible vLLM server Oct 11, 2024

terrytangyuan marked this pull request as ready for review October 11, 2024 01:33

terrytangyuan requested review from ashwinb, yanxi0830, hardikjshah, dltn and raghotham as code owners October 11, 2024 01:33

ashwinb requested changes Oct 11, 2024

View reviewed changes

llama_stack/providers/adapters/inference/vllm/vllm.py Outdated Show resolved Hide resolved

llama_stack/providers/adapters/inference/vllm/vllm.py Show resolved Hide resolved

terrytangyuan force-pushed the provider-vllm branch from 7338281 to a808786 Compare October 11, 2024 18:38

ashwinb reviewed Oct 11, 2024

View reviewed changes

llama_stack/providers/adapters/inference/vllm/vllm.py Outdated Show resolved Hide resolved

ashwinb reviewed Oct 15, 2024

View reviewed changes

terrytangyuan mentioned this pull request Oct 19, 2024

llama stack distributions / templates / docker refactor #266

Merged

terrytangyuan added 9 commits October 18, 2024 22:02

Add vLLM provider

925e1af

Signed-off-by: Yuan Tang <[email protected]>

Fix import and model mapping

765f2c8

Signed-off-by: Yuan Tang <[email protected]>

Fixes

ad4e65e

Update

cdadf0f

Working

7bbce63

Fix pre-commit

f93f7f0

Signed-off-by: Yuan Tang <[email protected]>

Fix pre-commit

24f6491

Signed-off-by: Yuan Tang <[email protected]>

Address comments

660983b

Address feedback

c085886

Signed-off-by: Yuan Tang <[email protected]>

terrytangyuan force-pushed the provider-vllm branch from bede8e8 to c085886 Compare October 19, 2024 02:03

ashwinb requested changes Oct 21, 2024

View reviewed changes

ashwinb approved these changes Oct 21, 2024

View reviewed changes

ashwinb reviewed Oct 21, 2024

View reviewed changes

Update inference.py to comment new provider.

7f3e6a4

ashwinb merged commit a27a2cd into meta-llama:main Oct 21, 2024
3 of 4 checks passed

yanxi0830 pushed a commit that referenced this pull request Oct 21, 2024

Add vLLM inference provider for OpenAI compatible vLLM server (#178)

74e6356

This PR adds vLLM inference provider for OpenAI compatible vLLM server.

terrytangyuan deleted the provider-vllm branch October 23, 2024 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM inference provider for OpenAI compatible vLLM server #178

Add vLLM inference provider for OpenAI compatible vLLM server #178

terrytangyuan commented Oct 3, 2024 •

edited

Loading

facebook-github-bot commented Oct 3, 2024

russellb commented Oct 3, 2024

terrytangyuan commented Oct 3, 2024

russellb commented Oct 3, 2024

terrytangyuan commented Oct 3, 2024 •

edited

Loading

russellb commented Oct 3, 2024

facebook-github-bot commented Oct 6, 2024

terrytangyuan commented Oct 11, 2024

ashwinb left a comment

terrytangyuan commented Oct 15, 2024

ashwinb Oct 15, 2024 •

edited

Loading

terrytangyuan Oct 16, 2024

ashwinb commented Oct 15, 2024

terrytangyuan commented Oct 16, 2024 •

edited

Loading

ashwinb left a comment

ashwinb Oct 21, 2024

ashwinb commented Oct 21, 2024

ashwinb left a comment

ashwinb Oct 21, 2024

terrytangyuan commented Oct 21, 2024

Add vLLM inference provider for OpenAI compatible vLLM server #178

Add vLLM inference provider for OpenAI compatible vLLM server #178

Conversation

terrytangyuan commented Oct 3, 2024 • edited Loading

facebook-github-bot commented Oct 3, 2024

Action Required

Process

russellb commented Oct 3, 2024

terrytangyuan commented Oct 3, 2024

russellb commented Oct 3, 2024

terrytangyuan commented Oct 3, 2024 • edited Loading

russellb commented Oct 3, 2024

facebook-github-bot commented Oct 6, 2024

terrytangyuan commented Oct 11, 2024

ashwinb left a comment

Choose a reason for hiding this comment

terrytangyuan commented Oct 15, 2024

ashwinb Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

terrytangyuan Oct 16, 2024

Choose a reason for hiding this comment

ashwinb commented Oct 15, 2024

terrytangyuan commented Oct 16, 2024 • edited Loading

ashwinb left a comment

Choose a reason for hiding this comment

ashwinb Oct 21, 2024

Choose a reason for hiding this comment

ashwinb commented Oct 21, 2024

ashwinb left a comment

Choose a reason for hiding this comment

ashwinb Oct 21, 2024

Choose a reason for hiding this comment

terrytangyuan commented Oct 21, 2024

terrytangyuan commented Oct 3, 2024 •

edited

Loading

terrytangyuan commented Oct 3, 2024 •

edited

Loading

ashwinb Oct 15, 2024 •

edited

Loading

terrytangyuan commented Oct 16, 2024 •

edited

Loading