Splitting LLM layers across multiple GPUs #1052

JOHW85 · 2023-01-22T18:10:18Z

As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here:
https://huggingface.co/docs/transformers/v4.15.0/parallelism

guillaumekln · 2023-01-23T15:27:28Z

Yes, it would be great to implement tensor parallelism for large models.

Right now we support data parallelism on the GPU. We refer to it simply as "parallel execution" in the documentation.

DHOFM · 2023-09-14T12:42:13Z

I would very much appreciate it if tensor parallelism could be implemented. Tried llama-2-13 on 2 RTX 3090 in fp16 and got OOM - and for sure 8bit works fine on one GPU

minhthuc2502 · 2024-03-01T15:58:08Z

I just pushed a PR #1599 to support tensor parallel. This will help to split models into multiple GPUs different. I tested this feature with some models like Llama2, translator model,... I appreciate if you could test this feature with others models or give some suggestions about principle models to test.

I do some tests with Llama2:

Nb Machine	GPUs	type of GPUs	Batch size	Perf (token/sec)	GPU memory	quantization	model
1	1	Tesla V100-PCIE-16GB	1	46.9	7352MB	Yes	llama 7b
1	2	Tesla V100-PCIE-16GB	1	51.5	3848MB	Yes	llama 7b
2	4	Tesla V100-PCIE-16GB	1	17.8	2280MB	Yes	llama 7b
1	1	Tesla V100-PCIE-16GB	5	185.3	7352MB	Yes	llama 7b
1	2	Tesla V100-PCIE-16GB	5	176	3848MB	Yes	llama 7b
2	4	Tesla V100-PCIE-16GB	5	62	2280MB	Yes	llama 7b
1	1	Tesla V100-PCIE-16GB	1	43.3	13880MB	No	llama 7b
1	2	Tesla V100-PCIE-16GB	1	66.5	7240MB	No	llama 7b
2	4	Tesla V100-PCIE-16GB	1	31.9	4136MB	No	llama 7b
1	1	Tesla V100-PCIE-16GB	5	179.3	13880M	No	llama 7b
1	2	Tesla V100-PCIE-16GB	5	249.5	7240MB	No	llama 7b
2	4	Tesla V100-PCIE-16GB	5	101.7	4136MB	No	llama 7b

If the GPUs are in the same machine, the inference shows better performance. On the other hand, if GPUs are on different machines, we have lower performance due to the latency in the network.

vince62s · 2024-03-01T17:26:25Z

did you run 5 samples in batch_size = 1 sentence or did you run batch_size = 5 sentences ?

minhthuc2502 · 2024-03-02T11:12:01Z

I updated the comment above for 2 cases: batch_size = 1 and batch_size = 5.

minhthuc2502 · 2024-03-05T16:52:00Z

I'll close this issue as the feature is now supported. If you have any problems, feel free to open the new issue.

JOHW85 changed the title ~~Splitting LLM layers across multiple GPUs~~ Splitting LLM layers across multiple GPUs (Model Parallelism) Jan 22, 2023

guillaumekln added the enhancement New feature or request label Jan 23, 2023

guillaumekln mentioned this issue Apr 27, 2023

Does CT2 support loading of two GPUs #1199

Closed

guillaumekln changed the title ~~Splitting LLM layers across multiple GPUs (Model Parallelism)~~ Splitting LLM layers across multiple GPUs Jun 5, 2023

guillaumekln mentioned this issue Jul 3, 2023

Exception when exporting bloomz model #1324

Open

michaelfeil mentioned this issue Jul 19, 2023

Multi-GPU support for GeneratorCT2fromHfHub Falcon-40B-instruct michaelfeil/hf-hub-ctranslate2#11

Open

This was referenced Aug 3, 2023

Cannot load model on multi gpus #1396

Closed

Multi GPU #1412

Closed

This was referenced Aug 24, 2023

Can the quantized model be splitted to multi shards and loaded to different GPUs? #1430

Closed

Distributed mode #1416

Closed

guillaumekln pinned this issue Aug 29, 2023

guillaumekln mentioned this issue Sep 12, 2023

attempting to convert tiiuae/falcon-180B-chat #1472

Closed

minhthuc2502 closed this as completed Mar 5, 2024

minhthuc2502 unpinned this issue Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting LLM layers across multiple GPUs #1052

Splitting LLM layers across multiple GPUs #1052

JOHW85 commented Jan 22, 2023

guillaumekln commented Jan 23, 2023 •

edited

Loading

DHOFM commented Sep 14, 2023 •

edited

Loading

minhthuc2502 commented Mar 1, 2024 •

edited

Loading

vince62s commented Mar 1, 2024

minhthuc2502 commented Mar 2, 2024

minhthuc2502 commented Mar 5, 2024

Splitting LLM layers across multiple GPUs #1052

Splitting LLM layers across multiple GPUs #1052

Comments

JOHW85 commented Jan 22, 2023

guillaumekln commented Jan 23, 2023 • edited Loading

DHOFM commented Sep 14, 2023 • edited Loading

minhthuc2502 commented Mar 1, 2024 • edited Loading

vince62s commented Mar 1, 2024

minhthuc2502 commented Mar 2, 2024

minhthuc2502 commented Mar 5, 2024

guillaumekln commented Jan 23, 2023 •

edited

Loading

DHOFM commented Sep 14, 2023 •

edited

Loading

minhthuc2502 commented Mar 1, 2024 •

edited

Loading