Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy could not be recovered with the default setting of pytorch/examples/imagenet/main.py #1

Open
bliu3650 opened this issue Aug 4, 2022 · 1 comment

Comments

@bliu3650
Copy link

bliu3650 commented Aug 4, 2022

After modified the script for mpirun launch, training on 8xRTX3090 with nccl backend can recover the accuracy, while switch to cgx backend, the top1 accuracy always be under 1%. The model used for validation is resnet50.

Besides the default setting of hyper-parameters (e.g. batch, lr, wd), the quantization bits and bucket size are set to 4 and 1024 according to the paper.

Could you share more details about the reproducing of resnet50 on imagenet with cgx backend? Thanks.

@ilmarkov
Copy link
Member

@bliu3650
I am very sorry for the delay.

We were using Nvidia examples for the experiments. We did not modify the hyperparameters for running with cgx backend.

Please, try new version of the library as the old one had issues that occasionally affected the training accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants