Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load model on multi gpus #1396

Closed
weiqisun opened this issue Aug 2, 2023 · 4 comments
Closed

Cannot load model on multi gpus #1396

weiqisun opened this issue Aug 2, 2023 · 4 comments

Comments

@weiqisun
Copy link

weiqisun commented Aug 2, 2023

Hi, I'm trying to run inference with the llama-2-70b-chat model on two A100-80GB GPUs. I converted the model to the ct2 format with dtype bfloat16 by:

ct2-transformers-converter --model meta-llama/Llama-2-70b-chat-hf --quantization bfloat16 --output_dir ./llama-2-70b-chat-bf16-ct2

I understand that I should be able to fit the model on a single GPU if I run the model with 8 bits quantization. But I need to do some apple-to-apple comparisons with bfloat16.

Then in python I tried to load the model on 2 GPUs. I have 8 GPUs and I was trying to load them on the last two, but I got the CUDA out of memory error:

>>> import ctranslate2 as ct2
>>> generator = ct2.Generator("llama-2-70b-chat-bf16-ct2", device="cuda", device_index=[6,7], compute_type="bfloat16")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA failed with error out of memory

However, from watching nvidia-smi, CTranslate2 only tried to put the model on the first device given in device_index, which is device 6 in this case. There is zero usage of device 7:

Wed Aug  2 16:42:58 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0    65W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   33C    P0    64W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    64W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   36C    P0    66W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:80:00.0 Off |                    0 |
| N/A   34C    P0    64W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:80:01.0 Off |                    0 |
| N/A   36C    P0    70W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:80:02.0 Off |                    0 |
| N/A   35C    P0    69W / 400W |  80807MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:80:03.0 Off |                    0 |
| N/A   36C    P0    66W / 400W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    6   N/A  N/A    693867      C   python                          80804MiB |
+-----------------------------------------------------------------------------+

I tried to use eight GPUs with device_index=[0,1,2,3,4,5,6,7] and also tried to set device="auto". But I got the same behavior.

Did I miss anything in order to run the inference on multi-GPUs?

Thanks!

@guillaumekln
Copy link
Collaborator

guillaumekln commented Aug 3, 2023

Hi,

CTranslate2 does not support model or tensor parallelism on multiple GPUs, if that's what you are trying to do. See #1052.

Setting multiple index to device_index enables data parallelism, where the full model is loaded on each GPU and each GPU can process a different batch in parallel.

@weiqisun
Copy link
Author

weiqisun commented Aug 3, 2023

Hi @guillaumekln, thanks for the clarification! I was not aware that it doesn't support model parallelism. Looking forward to the feature being implemented!

@guillaumekln
Copy link
Collaborator

Closing this issue in favor of the other one.

@aongwachi1
Copy link

do we have any tips to speed up inference for deploying inference for model "llama-2-70b-chat" ? right now it's around 8 token per sec from side.
@guillaumekln @weiqisun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants