-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline vLLM inference provider #181
Conversation
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
0834083
to
2444737
Compare
@@ -63,7 +63,15 @@ async def completion( | |||
stream: Optional[bool] = False, | |||
logprobs: Optional[LogProbConfig] = None, | |||
) -> AsyncGenerator: | |||
raise NotImplementedError() | |||
messages = [Message(role="user", content=content)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good but unrelated change. Ok to move it out of this PR into a new PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! so funny story ... I just spent a few minutes wondering where my completion()
code went, because I swore I remembered writing it. I did it in the wrong file! ha. I just had it open as a reference at one point ...
yes, I'll absolutely remove it and put it back in vllm.py
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed ... (oops!)
This is just like `local` using `meta-reference` for everything except it uses `vllm` for inference. Docker works, but So far, `conda` is a bit easier to use with the vllm provider. The default container base image does not include all the necessary libraries for all vllm features. More cuda dependencies are necessary. I started changing this base image used in this template, but it also required changes to the Dockerfile, so it was getting too involved to include in the first PR. Signed-off-by: Russell Bryant <[email protected]>
This is the start of an inline inference provider using vllm as a library. Issue meta-llama#142 Working so far: * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True` * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False` Example: ``` $ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False User>hello world, write me a 2 sentence poem about the moon Assistant> The moon glows bright in the midnight sky A beacon of light, ``` I have only tested these models: * `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4) * `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1) Signed-off-by: Russell Bryant <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks solid! Let's get this in.
This is 🔥 btw. |
@ashwinb Thank you! I'll file some follow-up issues about known issues or things I know should be improved. An idea - It may be helpful to have a |
This seems to be working in a very basic state, at least.
using the provided iference test client, i've tested chat completion with
streaming and not streaming.
I have only tested with
Llama3.1-8B-Instruct
andLlama3.2-1B-Instruct
Multi-GPU support is working in conda, but not docker at the moment (more work
is needed in the docker image - additional dependencies necessary).
Tool support probably doesn't work since I haven't tested it at all.
a08fd8f Add boilerplate for vllm inference provider
31a0c51 Add vllm to the inference registry
08da5d0 Add a local-vllm template
5626e79 Implement (chat_)completion for vllm provider
commit a08fd8f
Author: Russell Bryant [email protected]
Date: Sat Sep 28 18:46:35 2024 +0000
commit 31a0c51
Author: Russell Bryant [email protected]
Date: Sat Sep 28 19:06:53 2024 +0000
commit 08da5d0
Author: Russell Bryant [email protected]
Date: Sat Sep 28 19:10:04 2024 +0000
commit 5626e79
Author: Russell Bryant [email protected]
Date: Tue Oct 1 13:12:11 2024 +0000