Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting LLM layers across multiple GPUs #1052

Closed
JOHW85 opened this issue Jan 22, 2023 · 6 comments
Closed

Splitting LLM layers across multiple GPUs #1052

JOHW85 opened this issue Jan 22, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@JOHW85
Copy link

JOHW85 commented Jan 22, 2023

As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here:
https://huggingface.co/docs/transformers/v4.15.0/parallelism

@JOHW85 JOHW85 changed the title Splitting LLM layers across multiple GPUs Splitting LLM layers across multiple GPUs (Model Parallelism) Jan 22, 2023
@guillaumekln guillaumekln added the enhancement New feature or request label Jan 23, 2023
@guillaumekln
Copy link
Collaborator

guillaumekln commented Jan 23, 2023

Yes, it would be great to implement tensor parallelism for large models.

Right now we support data parallelism on the GPU. We refer to it simply as "parallel execution" in the documentation.

@DHOFM
Copy link

DHOFM commented Sep 14, 2023

I would very much appreciate it if tensor parallelism could be implemented. Tried llama-2-13 on 2 RTX 3090 in fp16 and got OOM - and for sure 8bit works fine on one GPU

@minhthuc2502
Copy link
Collaborator

minhthuc2502 commented Mar 1, 2024

I just pushed a PR #1599 to support tensor parallel. This will help to split models into multiple GPUs different. I tested this feature with some models like Llama2, translator model,... I appreciate if you could test this feature with others models or give some suggestions about principle models to test.

I do some tests with Llama2:

Nb Machine GPUs type of GPUs Batch size Perf (token/sec) GPU memory quantization model
1 1 Tesla V100-PCIE-16GB 1 46.9 7352MB Yes llama 7b
1 2 Tesla V100-PCIE-16GB 1 51.5 3848MB Yes llama 7b
2 4 Tesla V100-PCIE-16GB 1 17.8 2280MB Yes llama 7b
1 1 Tesla V100-PCIE-16GB 5 185.3 7352MB Yes llama 7b
1 2 Tesla V100-PCIE-16GB 5 176 3848MB Yes llama 7b
2 4 Tesla V100-PCIE-16GB 5 62 2280MB Yes llama 7b
1 1 Tesla V100-PCIE-16GB 1 43.3 13880MB No llama 7b
1 2 Tesla V100-PCIE-16GB 1 66.5 7240MB No llama 7b
2 4 Tesla V100-PCIE-16GB 1 31.9 4136MB No llama 7b
1 1 Tesla V100-PCIE-16GB 5 179.3 13880M No llama 7b
1 2 Tesla V100-PCIE-16GB 5 249.5 7240MB No llama 7b
2 4 Tesla V100-PCIE-16GB 5 101.7 4136MB No llama 7b

If the GPUs are in the same machine, the inference shows better performance. On the other hand, if GPUs are on different machines, we have lower performance due to the latency in the network.

@vince62s
Copy link
Member

vince62s commented Mar 1, 2024

did you run 5 samples in batch_size = 1 sentence or did you run batch_size = 5 sentences ?

@minhthuc2502
Copy link
Collaborator

I updated the comment above for 2 cases: batch_size = 1 and batch_size = 5.

@minhthuc2502
Copy link
Collaborator

I'll close this issue as the feature is now supported. If you have any problems, feel free to open the new issue.

@minhthuc2502 minhthuc2502 unpinned this issue Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants
@JOHW85 @guillaumekln @vince62s @DHOFM @minhthuc2502 and others