Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline vLLM inference provider #181

Merged
merged 4 commits into from
Oct 6, 2024
Merged

Inline vLLM inference provider #181

merged 4 commits into from
Oct 6, 2024

Conversation

russellb
Copy link
Contributor

@russellb russellb commented Oct 3, 2024

This seems to be working in a very basic state, at least.

  • using the provided iference test client, i've tested chat completion with
    streaming and not streaming.

  • I have only tested with Llama3.1-8B-Instruct and Llama3.2-1B-Instruct

  • Multi-GPU support is working in conda, but not docker at the moment (more work
    is needed in the docker image - additional dependencies necessary).

  • Tool support probably doesn't work since I haven't tested it at all.

a08fd8f Add boilerplate for vllm inference provider
31a0c51 Add vllm to the inference registry
08da5d0 Add a local-vllm template
5626e79 Implement (chat_)completion for vllm provider

commit a08fd8f
Author: Russell Bryant [email protected]
Date: Sat Sep 28 18:46:35 2024 +0000

Add boilerplate for vllm inference provider

Signed-off-by: Russell Bryant <[email protected]>

commit 31a0c51
Author: Russell Bryant [email protected]
Date: Sat Sep 28 19:06:53 2024 +0000

Add vllm to the inference registry

Signed-off-by: Russell Bryant <[email protected]>

commit 08da5d0
Author: Russell Bryant [email protected]
Date: Sat Sep 28 19:10:04 2024 +0000

Add a local-vllm template

This is just like `local` using `meta-reference` for everything except
it uses `vllm` for inference.

Docker works, but So far, `conda` is a bit easier to use with the vllm
provider. The default container base image does not include all the
necessary libraries for all vllm features. More cuda dependencies are
necessary.

I started changing this base image used in this template, but it also
required changes to the Dockerfile, so it was getting too involved to
include in the first PR.

Signed-off-by: Russell Bryant <[email protected]>

commit 5626e79
Author: Russell Bryant [email protected]
Date: Tue Oct 1 13:12:11 2024 +0000

Implement (chat_)completion for vllm provider

This is the start of an inline inference provider using vllm as a
library.

Issue #142

Working so far:

* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True`
* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False`

Example:

```
$ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False
User>hello world, write me a 2 sentence poem about the moon
Assistant>
The moon glows bright in the midnight sky
A beacon of light,
```

I have only tested these models:

* `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4)
* `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)

Signed-off-by: Russell Bryant <[email protected]>

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2024
@russellb russellb force-pushed the vllm branch 6 times, most recently from 0834083 to 2444737 Compare October 4, 2024 18:44
@russellb russellb changed the title WIP: Inline vllm inference provider Inline vLLM inference provider Oct 4, 2024
@russellb russellb marked this pull request as ready for review October 4, 2024 18:46
@@ -63,7 +63,15 @@ async def completion(
stream: Optional[bool] = False,
logprobs: Optional[LogProbConfig] = None,
) -> AsyncGenerator:
raise NotImplementedError()
messages = [Message(role="user", content=content)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good but unrelated change. Ok to move it out of this PR into a new PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! so funny story ... I just spent a few minutes wondering where my completion() code went, because I swore I remembered writing it. I did it in the wrong file! ha. I just had it open as a reference at one point ...

yes, I'll absolutely remove it and put it back in vllm.py !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed ... (oops!)

This is just like `local` using `meta-reference` for everything except
it uses `vllm` for inference.

Docker works, but So far, `conda` is a bit easier to use with the vllm
provider. The default container base image does not include all the
necessary libraries for all vllm features. More cuda dependencies are
necessary.

I started changing this base image used in this template, but it also
required changes to the Dockerfile, so it was getting too involved to
include in the first PR.

Signed-off-by: Russell Bryant <[email protected]>
This is the start of an inline inference provider using vllm as a
library.

Issue meta-llama#142

Working so far:

* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True`
* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False`

Example:

```
$ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False
User>hello world, write me a 2 sentence poem about the moon
Assistant>
The moon glows bright in the midnight sky
A beacon of light,
```

I have only tested these models:

* `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4)
* `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)

Signed-off-by: Russell Bryant <[email protected]>
Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid! Let's get this in.

@ashwinb ashwinb merged commit f73e247 into meta-llama:main Oct 6, 2024
3 checks passed
@ashwinb
Copy link
Contributor

ashwinb commented Oct 6, 2024

This is 🔥 btw.

@russellb
Copy link
Contributor Author

russellb commented Oct 6, 2024

This is 🔥 btw.

@ashwinb Thank you! I'll file some follow-up issues about known issues or things I know should be improved.

An idea - It may be helpful to have a vllm label I can apply to those issues. We could have similar labels for other providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants