-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot load model on multi gpus #1396
Comments
Hi, CTranslate2 does not support model or tensor parallelism on multiple GPUs, if that's what you are trying to do. See #1052. Setting multiple index to |
Hi @guillaumekln, thanks for the clarification! I was not aware that it doesn't support model parallelism. Looking forward to the feature being implemented! |
Closing this issue in favor of the other one. |
do we have any tips to speed up inference for deploying inference for model "llama-2-70b-chat" ? right now it's around 8 token per sec from side. |
Hi, I'm trying to run inference with the
llama-2-70b-chat
model on two A100-80GB GPUs. I converted the model to the ct2 format with dtypebfloat16
by:I understand that I should be able to fit the model on a single GPU if I run the model with 8 bits quantization. But I need to do some apple-to-apple comparisons with
bfloat16
.Then in python I tried to load the model on 2 GPUs. I have 8 GPUs and I was trying to load them on the last two, but I got the CUDA out of memory error:
However, from watching
nvidia-smi
, CTranslate2 only tried to put the model on the first device given indevice_index
, which is device 6 in this case. There is zero usage of device 7:I tried to use eight GPUs with
device_index=[0,1,2,3,4,5,6,7]
and also tried to setdevice="auto"
. But I got the same behavior.Did I miss anything in order to run the inference on multi-GPUs?
Thanks!
The text was updated successfully, but these errors were encountered: