-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem launching training with less GPUs #92
Comments
OH, this is because we save the seed and randomize information in the checkpoint. So when you resume from a ckpt saved on a 7-GPUs node and run the training on 6GPUs it will cause the bug you met. One solution is to use use |
Hi. Unfortunately, it doesn't work. Same error:
I tryed using |
Oh, it should be --model.load_from xxx.pth |
Hi.
And using
|
Hi. I'm still stuck with this problem and have no clue... |
Hi @lawrence-cj . |
what is the output of |
Thank you @lawrence-cj . This checkpoint is currently at 222k steps trained on 2 GPUs.
And here is the output:
So, the config of
And config.resume_from is also on latest: How can I force it ? |
Hi @lawrence-cj . |
Hi.
I setup a Sana training session with one 4090 GPU on a PC, everything was fine so I moved the config and the checkpoint to a PC with 7 x 4090. Everything was OK on multi-gpu.
Later, I restarted the training session with only 6 GPUs and got this error:
(Note that it restarts fine with 7 GPUs, but not with less than that)
The text was updated successfully, but these errors were encountered: