-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why running time is different from original? #1
Comments
The original authors report results for the V100 GPU. The results I report here are for A100 and P100 GPUs, because that is what I have access to. The V100 has 640 tensor cores while the A100 only has 432. The P100 does not have any tensor cores at all: When I run the author's original unmodified code on the A100 GPU, it takes 1.33 times longer. My simplified implementation takes 34.86 seconds on the A100 GPU, which is roughly 26 seconds when divided by 1.33, which is the same time that the authors report on their V100 GPU. In short, the code's running time is different because I ran it on different hardware, but extrapolating to the author's hardware gives almost the expected numbers. |
I'm running the codes on V100 GPU, and your codes' running time is actually different from the original author. After some testing I think the main reason is that your codes did not include the cuDNN warming up like in author's original code. From my test, it seems like adding the warming up codes is enough to generate similar training time. Also, I think there are a few other differences between yours and the author's, please correct me if I'm wrong:
|
Which running time do you get? For me, the first batch takes one second longer than the others, but after that it is fairly stable. If you want to exclude PyTorch initialization time, simply call if __name__ == "__main__":
for _ in range(5):
main()
In my opinion, preprocessing times should be included in benchmarks to incentivise the library developers to optimize their code in that regard, but that is currently not the case unfortunately. |
My running time (on V100) was around 28-29 seconds for 10 epochs after adding Pytorch initialization codes. Without the initialization, the first batch took around 3 seconds longer in my case.
Nevertheless, your codes are much easier to understand & use than the original author's codes, so thank you for sharing! |
I have not tested different versions, but I think that the version of cuDNN is even more important for performance (as long as PyTorch makes use of the new features). Unfortunately, I could not find any benchmarks which compare different cuDNN versions.
I have not used AMP yet, but I would probably start with benchmarking just a few layers to see if it makes a difference and if it is worth the effort. If you should try it, let me know of the results.
I am glad that the code is helpful to you :) |
Thank you very much! This code helps me a lot. The results on my V100:
|
Glad that you found it useful! And thank you for the results on a V100 GPU. I'll add them to the README right away. |
Just a quick update: The PyTorch version of our server has been upgraded recently and the results for the A100 GPU are now over twice as fast, probably because the tensor cores are actually being utilized now. Full log for 100 runs: cifar10-fast-simple/logs/A100.txt Line 2006 in 242b699 |
Hello,
Your codes are quite easy to understand; however, do you know why your code's running time is different from the original author's?
The text was updated successfully, but these errors were encountered: