Replies: 2 comments
-
Hi @vadimcn - I'm no expert, but I came across this reply which seems to possibly help with your question :) seems the answer is to maybe use pipeline parallelism as well. |
Beta Was this translation helpful? Give feedback.
-
@nicklausbrown Yeah, tried it: |
Beta Was this translation helpful? Give feedback.
-
I have a 16-GPU machine available, and a model that fits on a single GPU. What is the best way to optimize vLLM performance in this situation?
I've tried tensor parallel, however it seems that it requires both the number of attention heads and the vocabulary size to be divisible by the number of GPUs, and in my case the GCD of these is 4 😞
Beta Was this translation helpful? Give feedback.
All reactions