-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting LLM layers across multiple GPUs #1052
Comments
Yes, it would be great to implement tensor parallelism for large models. Right now we support data parallelism on the GPU. We refer to it simply as "parallel execution" in the documentation. |
I would very much appreciate it if tensor parallelism could be implemented. Tried llama-2-13 on 2 RTX 3090 in fp16 and got OOM - and for sure 8bit works fine on one GPU |
I just pushed a PR #1599 to support tensor parallel. This will help to split models into multiple GPUs different. I tested this feature with some models like Llama2, translator model,... I appreciate if you could test this feature with others models or give some suggestions about principle models to test. I do some tests with Llama2:
If the GPUs are in the same machine, the inference shows better performance. On the other hand, if GPUs are on different machines, we have lower performance due to the latency in the network. |
did you run 5 samples in batch_size = 1 sentence or did you run batch_size = 5 sentences ? |
I updated the comment above for 2 cases: batch_size = 1 and batch_size = 5. |
I'll close this issue as the feature is now supported. If you have any problems, feel free to open the new issue. |
As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here:
https://huggingface.co/docs/transformers/v4.15.0/parallelism
The text was updated successfully, but these errors were encountered: