-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cifar10 example is not scalable with multiple GPUs #75
Comments
Thanks for reporting @H4dr1en ! I'd like to reproduce your results to see what happens once I have some time for that. Today, we have some benchmarks for Pascal VOC on 1, 2 and 4 GPUs (GeForce RTX 2080 Ti)
Btw, thanks for pointing out to clearml-task feature, cool feature ! |
@H4dr1en Thanks for that report. It sounds weird, I had similar experiments when I worked in a research center with GPU cluster and it was fine regarding scalability. Did you try disabling clearml ? Transferring the results to the server can create disruptions and interruptions during learning. |
Thank for your answers! From what you reported, scaling the training speed linearly with the number of GPUs should be achievable, so there could be something wrong that could be fixed. As a context, I do observe similar bad scalability for my own use case within g4dn.12xlarge instances, so I hope that if we can find the bottleneck with the cifar10 example, it would also unlock my other project. |
@H4dr1en can please you try original cifar10 example on your infrastructure using 1, 2, 4 GPUs and report back runtime here ? CUDA_VISIBLE_DEVICES=0 python main.py run # for older pytorch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl"
# for >=1.9
torchrun --nproc_per_node=2 main.py run --backend="nccl" # for older pytorch
python -u -m torch.distributed.launch --nproc_per_node=4 --use_env main.py run --backend="nccl"
# for >=1.9
torchrun --nproc_per_node=4 main.py run --backend="nccl" my times on 1 and 2 GPUs GTX1080Ti to compare:
|
Thanks @vfdev-5 ! Here are my results:
It looks like there is something wrong with 1 GPU on g4dn.12xlarge Btw, is it fair to compare the speed this way? ie. in multi-gpu context, each GPU get a smaller batch_size |
Thanks for the results @H4dr1en ! Definitely, there is something unclear with 1 GPU case for
Well, I'd say we are interested how quickly the task was done. By the task we have to measure the number of processed images. If in multi-gpu context we load each GPU as in the a single GPU case, we have to reduce the number of iterations to run, otherwise they wont accomplish the same task, I think. EDIT: in the logs for |
No, here are the logs for this run:
|
But the behaviour above for 1 GPU on g4dn.12xlarge is probably a separate issue. Sorry I was not very explicit on the issue description, my main concern is the following: If we define the factor of improvement as I would expect to have a factor of improvement approaching 2, and it seems to never be achieved. What could be the reason for that? |
I think in case of cifar10 a larger model can give something as a linear scaling (up to a certain limit).
See also results for Pascal VOC: #75 (comment) where factor ~ N GPUs for N = 1, 2 and 4 |
I slightly adapted the cifar10 example in this fork, basically removing python-fire and adding the torch.distributed.launch function, so that it can be executed as a standalone script with clearml-task.
I executed the following script with
nproc_per_node
in [1, 2, 3, 4] on a AWS g4dn.12xlarge instance (x4 T4 GPUs). I got the following results:Here I disabled DataParallel as mentionned in DataParallel is used by auto_model with single GPU pytorch/ignite#2447
I am increasing the batch size by 16 each time I add a GPU, so that each GPU has the same number of samples. I didn't change the default number of processes (8) for all of them, because I didn't oberserve that the GPUs were under-used (below 95%)
GPU utilization as reported by clearml
I was expecting to observe a quasi-linear time improvement, but it isn't the case. Am I missing something?
PS: Here are the requirements I used to execute the script
The text was updated successfully, but these errors were encountered: