Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why running time is different from original? #1

Open
minh-h-ng opened this issue May 11, 2021 · 8 comments
Open

Why running time is different from original? #1

minh-h-ng opened this issue May 11, 2021 · 8 comments

Comments

@minh-h-ng
Copy link

Hello,

Your codes are quite easy to understand; however, do you know why your code's running time is different from the original author's?

@99991
Copy link
Owner

99991 commented May 11, 2021

The original authors report results for the V100 GPU. The results I report here are for A100 and P100 GPUs, because that is what I have access to.

The V100 has 640 tensor cores while the A100 only has 432. The P100 does not have any tensor cores at all:
https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ Since the code makes use of FP16, tensor cores are very important for performance. Side note: An A100's tensor core should be much faster than a V100's tensor core in theory, but that does not seem to apply here in practice. I have not investigated why that is the case. One possible explanation might be that PyTorch/CUDA drivers have not been optimized for that workload yet.

When I run the author's original unmodified code on the A100 GPU, it takes 1.33 times longer. My simplified implementation takes 34.86 seconds on the A100 GPU, which is roughly 26 seconds when divided by 1.33, which is the same time that the authors report on their V100 GPU.

In short, the code's running time is different because I ran it on different hardware, but extrapolating to the author's hardware gives almost the expected numbers.

@minh-h-ng
Copy link
Author

I'm running the codes on V100 GPU, and your codes' running time is actually different from the original author. After some testing I think the main reason is that your codes did not include the cuDNN warming up like in author's original code. From my test, it seems like adding the warming up codes is enough to generate similar training time.

Also, I think there are a few other differences between yours and the author's, please correct me if I'm wrong:

  • Author's final model (after TTA) used lr_schedule([0, epochs/5, epochs - ema_epochs], [0.0, 1.0, 0.1], batch_size with ema_epochs=2, so I think this part of your code torch.linspace(2e-3, 2e-4, 582) should use 776 instead of 582.
  • The preprocessing time in your codes is still quite large (more than 5 seconds), do you know the reason behind this?

@99991
Copy link
Owner

99991 commented May 11, 2021

Which running time do you get?

For me, the first batch takes one second longer than the others, but after that it is fairly stable.

If you want to exclude PyTorch initialization time, simply call main() multiple times and ignore the results for the first run:

if __name__ == "__main__":
    for _ in range(5):
        main()
  • I think the learning rate should ramp up for the first 194 batches, then down for 582 batches (194 + 582 = 776) and then stay there. The exact values are not that important, as long as the accuracy is still good in the end.You might get even better accuracy by scaling the learning rate individually for every weight tensor.
  • The preprocessing time is dominated by two things:
    1. Loading of the cifar10 dataset with torchvision. This currently takes over a second because torchvision is unpickling the data and computing md5 hashes for integrity every time. The integrity check is necessary because loading pickle files could execute arbitrary code, so clearly pickle is not a good file format for this use case. It would only take a few milliseconds to load the data if it was stored as a binary blob and then simply loaded with np.fromfile(..., dtype=np.uint8). But since DAWN bench does not consider preprocessing time, I kept the shorter torchvision code.
    2. The first call to PyTorch that does something with the GPU takes forever. The time could be greatly reduced by only loading the functionality that is actually being used in the end, but that would have to be done by the PyTorch and CUDA developers.

In my opinion, preprocessing times should be included in benchmarks to incentivise the library developers to optimize their code in that regard, but that is currently not the case unfortunately.

@minh-h-ng
Copy link
Author

minh-h-ng commented May 12, 2021

My running time (on V100) was around 28-29 seconds for 10 epochs after adding Pytorch initialization codes. Without the initialization, the first batch took around 3 seconds longer in my case.

  • You are right about the learning rate, the learning rate should be changed (for the second time) at 776, but it was changed (for the first time) at 194, so torch.linspace(2e-3, 2e-4, 582) is correct.
  • This is interesting, I did not know that torchvision load pickle files for the cifar10 dataset. And I agree with you about including the preprocessing time; I guess the DAWNBench organizers simply expected the loading & preprocessing time of cifar10 to be much shorter than training time due to the dataset being small.
  • Did you experience the training time difference when using different Pytorch versions? In my case, I experienced 10-15% longer training time when moving from Pytorch 1.2 to 1.6. From what I remember reading in some online posts, even earlier Pytorch versions may run even faster than Pytorch 1.2.
  • Also, do you have any suggestions on how to include Pytorch AMP (Automatic Mixed Precision) into your codes? I'm looking to test AMP with the codes, but not sure yet how to implement it.

Nevertheless, your codes are much easier to understand & use than the original author's codes, so thank you for sharing!

@99991
Copy link
Owner

99991 commented May 13, 2021

Did you experience the training time difference when using different Pytorch versions?

I have not tested different versions, but I think that the version of cuDNN is even more important for performance (as long as PyTorch makes use of the new features). Unfortunately, I could not find any benchmarks which compare different cuDNN versions.

Also, do you have any suggestions on how to include Pytorch AMP (Automatic Mixed Precision) into your codes?

I have not used AMP yet, but I would probably start with benchmarking just a few layers to see if it makes a difference and if it is worth the effort. If you should try it, let me know of the results.

Nevertheless, your codes are much easier to understand & use than the original author's codes, so thank you for sharing!

I am glad that the code is helpful to you :)

@ZipengFeng
Copy link

Thank you very much! This code helps me a lot.

The results on my V100:

Preprocessing: 4.78 seconds

epoch    batch    train time [sec]    validation accuracy
    1       97                4.24                 0.2051
    2      194                7.09                 0.7661
    3      291                9.93                 0.8749
    4      388               12.78                 0.8982
    5      485               15.62                 0.9139
    6      582               18.48                 0.9237
    7      679               21.33                 0.9301
    8      776               24.18                 0.9348
    9      873               27.04                 0.9396
   10      970               29.90                 0.9422

@99991
Copy link
Owner

99991 commented Mar 3, 2022

Glad that you found it useful! And thank you for the results on a V100 GPU. I'll add them to the README right away.

@99991
Copy link
Owner

99991 commented Mar 26, 2022

Just a quick update: The PyTorch version of our server has been upgraded recently and the results for the A100 GPU are now over twice as fast, probably because the tensor cores are actually being utilized now. Full log for 100 runs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants