GPU-parallel inference #12188

vadimcn · 2025-01-19T00:45:30Z

vadimcn
Jan 19, 2025

I have a 16-GPU machine available, and a model that fits on a single GPU. What is the best way to optimize vLLM performance in this situation?

I've tried tensor parallel, however it seems that it requires both the number of attention heads and the vocabulary size to be divisible by the number of GPUs, and in my case the GCD of these is 4 😞

nicklausbrown · 2025-01-22T00:15:28Z

nicklausbrown
Jan 22, 2025

Hi @vadimcn - I'm no expert, but I came across this reply which seems to possibly help with your question :) seems the answer is to maybe use pipeline parallelism as well.

##5500 (comment)

0 replies

vadimcn · 2025-01-24T04:20:37Z

vadimcn
Jan 24, 2025
Author

@nicklausbrown Yeah, tried it:
At first vLLM errored out with ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.
So I reduced pipeline_parallel_size to 8, and it appeared to initialize successfully... but eventually threw
NotImplementedError: Pipeline parallelism is only supported through AsyncLLMEngine as performance will be severely degraded otherwise.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-parallel inference #12188

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GPU-parallel inference #12188

vadimcn Jan 19, 2025

Replies: 2 comments

nicklausbrown Jan 22, 2025

vadimcn Jan 24, 2025 Author

vadimcn
Jan 19, 2025

nicklausbrown
Jan 22, 2025

vadimcn
Jan 24, 2025
Author