Cannot load model on multi gpus #1396

weiqisun · 2023-08-02T16:49:47Z

Hi, I'm trying to run inference with the llama-2-70b-chat model on two A100-80GB GPUs. I converted the model to the ct2 format with dtype bfloat16 by:

ct2-transformers-converter --model meta-llama/Llama-2-70b-chat-hf --quantization bfloat16 --output_dir ./llama-2-70b-chat-bf16-ct2

I understand that I should be able to fit the model on a single GPU if I run the model with 8 bits quantization. But I need to do some apple-to-apple comparisons with bfloat16.

Then in python I tried to load the model on 2 GPUs. I have 8 GPUs and I was trying to load them on the last two, but I got the CUDA out of memory error:

>>> import ctranslate2 as ct2
>>> generator = ct2.Generator("llama-2-70b-chat-bf16-ct2", device="cuda", device_index=[6,7], compute_type="bfloat16")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA failed with error out of memory

However, from watching nvidia-smi, CTranslate2 only tried to put the model on the first device given in device_index, which is device 6 in this case. There is zero usage of device 7:

Wed Aug  2 16:42:58 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0    65W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   33C    P0    64W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    64W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   36C    P0    66W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:80:00.0 Off |                    0 |
| N/A   34C    P0    64W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:80:01.0 Off |                    0 |
| N/A   36C    P0    70W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:80:02.0 Off |                    0 |
| N/A   35C    P0    69W / 400W |  80807MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:80:03.0 Off |                    0 |
| N/A   36C    P0    66W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    6   N/A  N/A    693867      C   python                          80804MiB |
+-----------------------------------------------------------------------------+

I tried to use eight GPUs with device_index=[0,1,2,3,4,5,6,7] and also tried to set device="auto". But I got the same behavior.

Did I miss anything in order to run the inference on multi-GPUs?

Thanks!

The text was updated successfully, but these errors were encountered:

guillaumekln · 2023-08-03T08:15:40Z

Hi,

CTranslate2 does not support model or tensor parallelism on multiple GPUs, if that's what you are trying to do. See #1052.

Setting multiple index to device_index enables data parallelism, where the full model is loaded on each GPU and each GPU can process a different batch in parallel.

weiqisun · 2023-08-03T18:55:49Z

Hi @guillaumekln, thanks for the clarification! I was not aware that it doesn't support model parallelism. Looking forward to the feature being implemented!

guillaumekln · 2023-08-04T08:10:06Z

Closing this issue in favor of the other one.

aongwachi1 · 2023-08-20T18:37:00Z

do we have any tips to speed up inference for deploying inference for model "llama-2-70b-chat" ? right now it's around 8 token per sec from side.
@guillaumekln @weiqisun

guillaumekln closed this as completed Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load model on multi gpus #1396

Cannot load model on multi gpus #1396

weiqisun commented Aug 2, 2023 •

edited

Loading

guillaumekln commented Aug 3, 2023 •

edited

Loading

weiqisun commented Aug 3, 2023

guillaumekln commented Aug 4, 2023

aongwachi1 commented Aug 20, 2023

Cannot load model on multi gpus #1396

Cannot load model on multi gpus #1396

Comments

weiqisun commented Aug 2, 2023 • edited Loading

guillaumekln commented Aug 3, 2023 • edited Loading

weiqisun commented Aug 3, 2023

guillaumekln commented Aug 4, 2023

aongwachi1 commented Aug 20, 2023

weiqisun commented Aug 2, 2023 •

edited

Loading

guillaumekln commented Aug 3, 2023 •

edited

Loading