Accuracy could not be recovered with the default setting of pytorch/examples/imagenet/main.py #1

bliu3650 · 2022-08-04T01:30:49Z

After modified the script for mpirun launch, training on 8xRTX3090 with nccl backend can recover the accuracy, while switch to cgx backend, the top1 accuracy always be under 1%. The model used for validation is resnet50.

Besides the default setting of hyper-parameters (e.g. batch, lr, wd), the quantization bits and bucket size are set to 4 and 1024 according to the paper.

Could you share more details about the reproducing of resnet50 on imagenet with cgx backend? Thanks.

ilmarkov · 2022-11-22T17:14:36Z

@bliu3650
I am very sorry for the delay.

We were using Nvidia examples for the experiments. We did not modify the hyperparameters for running with cgx backend.

Please, try new version of the library as the old one had issues that occasionally affected the training accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy could not be recovered with the default setting of pytorch/examples/imagenet/main.py #1

Accuracy could not be recovered with the default setting of pytorch/examples/imagenet/main.py #1

bliu3650 commented Aug 4, 2022

ilmarkov commented Nov 22, 2022

Accuracy could not be recovered with the default setting of pytorch/examples/imagenet/main.py #1

Accuracy could not be recovered with the default setting of pytorch/examples/imagenet/main.py #1

Comments

bliu3650 commented Aug 4, 2022

ilmarkov commented Nov 22, 2022