Inline vLLM inference provider #181

russellb · 2024-10-03T22:55:41Z

This seems to be working in a very basic state, at least.

using the provided iference test client, i've tested chat completion with
streaming and not streaming.
I have only tested with Llama3.1-8B-Instruct and Llama3.2-1B-Instruct
Multi-GPU support is working in conda, but not docker at the moment (more work
is needed in the docker image - additional dependencies necessary).
Tool support probably doesn't work since I haven't tested it at all.

a08fd8f Add boilerplate for vllm inference provider
31a0c51 Add vllm to the inference registry
08da5d0 Add a local-vllm template
5626e79 Implement (chat_)completion for vllm provider

commit a08fd8f
Author: Russell Bryant [email protected]
Date: Sat Sep 28 18:46:35 2024 +0000

Add boilerplate for vllm inference provider

Signed-off-by: Russell Bryant <[email protected]>

commit 31a0c51
Author: Russell Bryant [email protected]
Date: Sat Sep 28 19:06:53 2024 +0000

Add vllm to the inference registry

Signed-off-by: Russell Bryant <[email protected]>

commit 08da5d0
Author: Russell Bryant [email protected]
Date: Sat Sep 28 19:10:04 2024 +0000

Add a local-vllm template

This is just like `local` using `meta-reference` for everything except
it uses `vllm` for inference.

Docker works, but So far, `conda` is a bit easier to use with the vllm
provider. The default container base image does not include all the
necessary libraries for all vllm features. More cuda dependencies are
necessary.

I started changing this base image used in this template, but it also
required changes to the Dockerfile, so it was getting too involved to
include in the first PR.

Signed-off-by: Russell Bryant <[email protected]>

commit 5626e79
Author: Russell Bryant [email protected]
Date: Tue Oct 1 13:12:11 2024 +0000

Implement (chat_)completion for vllm provider

This is the start of an inline inference provider using vllm as a
library.

Issue #142

Working so far:

* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True`
* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False`

Example:

```
$ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False
User>hello world, write me a 2 sentence poem about the moon
Assistant>
The moon glows bright in the midnight sky
A beacon of light,
```

I have only tested these models:

* `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4)
* `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)

Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

llama_stack/providers/adapters/inference/__init__.py

raghotham · 2024-10-04T19:34:44Z

llama_stack/providers/adapters/inference/ollama/ollama.py

@@ -63,7 +63,15 @@ async def completion(
        stream: Optional[bool] = False,
        logprobs: Optional[LogProbConfig] = None,
    ) -> AsyncGenerator:
-        raise NotImplementedError()
+        messages = [Message(role="user", content=content)]


this is a good but unrelated change. Ok to move it out of this PR into a new PR?

Ha! so funny story ... I just spent a few minutes wondering where my completion() code went, because I swore I remembered writing it. I did it in the wrong file! ha. I just had it open as a reference at one point ...

yes, I'll absolutely remove it and put it back in vllm.py !

removed ... (oops!)

This is just like `local` using `meta-reference` for everything except it uses `vllm` for inference. Docker works, but So far, `conda` is a bit easier to use with the vllm provider. The default container base image does not include all the necessary libraries for all vllm features. More cuda dependencies are necessary. I started changing this base image used in this template, but it also required changes to the Dockerfile, so it was getting too involved to include in the first PR. Signed-off-by: Russell Bryant <[email protected]>

This is the start of an inline inference provider using vllm as a library. Issue meta-llama#142 Working so far: * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True` * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False` Example: ``` $ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False User>hello world, write me a 2 sentence poem about the moon Assistant> The moon glows bright in the midnight sky A beacon of light, ``` I have only tested these models: * `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4) * `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1) Signed-off-by: Russell Bryant <[email protected]>

ashwinb

This looks solid! Let's get this in.

ashwinb · 2024-10-06T06:34:28Z

This is 🔥 btw.

russellb · 2024-10-06T18:34:21Z

This is 🔥 btw.

@ashwinb Thank you! I'll file some follow-up issues about known issues or things I know should be improved.

An idea - It may be helpful to have a vllm label I can apply to those issues. We could have similar labels for other providers.

russellb added 2 commits October 3, 2024 20:48

Add boilerplate for vllm inference provider

a08fd8f

Signed-off-by: Russell Bryant <[email protected]>

Add vllm to the inference registry

31a0c51

Signed-off-by: Russell Bryant <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2024

russellb mentioned this pull request Oct 3, 2024

Add vLLM inference provider for OpenAI compatible vLLM server #178

Merged

russellb commented Oct 3, 2024

View reviewed changes

llama_stack/providers/adapters/inference/__init__.py Outdated Show resolved Hide resolved

russellb force-pushed the vllm branch 6 times, most recently from 0834083 to 2444737 Compare October 4, 2024 18:44

russellb changed the title ~~WIP: Inline vllm inference provider~~ Inline vLLM inference provider Oct 4, 2024

russellb marked this pull request as ready for review October 4, 2024 18:46

russellb requested review from ashwinb, yanxi0830, hardikjshah, dltn and raghotham as code owners October 4, 2024 18:46

raghotham reviewed Oct 4, 2024

View reviewed changes

russellb force-pushed the vllm branch from 2444737 to 26375b1 Compare October 4, 2024 20:00

russellb added 2 commits October 5, 2024 00:17

russellb force-pushed the vllm branch from 40515d2 to 5626e79 Compare October 5, 2024 00:18

russellb mentioned this pull request Oct 5, 2024

provider shutdown methods are not currently executed #188

Open

ashwinb approved these changes Oct 6, 2024

View reviewed changes

ashwinb merged commit f73e247 into meta-llama:main Oct 6, 2024
3 checks passed

This was referenced Oct 6, 2024

vllm: add docs for the inline vllm provider #198

Closed

vllm: test and fix tool support #199

Open

This was referenced Oct 6, 2024

vllm: improve container support #200

Open

vllm: expand configuration support #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline vLLM inference provider #181

Inline vLLM inference provider #181

russellb commented Oct 3, 2024 •

edited

Loading

raghotham Oct 4, 2024

russellb Oct 4, 2024

russellb Oct 4, 2024

ashwinb left a comment

ashwinb commented Oct 6, 2024

russellb commented Oct 6, 2024

Inline vLLM inference provider #181

Inline vLLM inference provider #181

Conversation

russellb commented Oct 3, 2024 • edited Loading

raghotham Oct 4, 2024

Choose a reason for hiding this comment

russellb Oct 4, 2024

Choose a reason for hiding this comment

russellb Oct 4, 2024

Choose a reason for hiding this comment

ashwinb left a comment

Choose a reason for hiding this comment

ashwinb commented Oct 6, 2024

russellb commented Oct 6, 2024

russellb commented Oct 3, 2024 •

edited

Loading