Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce the accuracy of ResNet #5593

Closed
Ezra-Yu opened this issue Mar 11, 2022 · 3 comments
Closed

Cannot reproduce the accuracy of ResNet #5593

Ezra-Yu opened this issue Mar 11, 2022 · 3 comments

Comments

@Ezra-Yu
Copy link

Ezra-Yu commented Mar 11, 2022

I have run the baseline of the resnet in this blog . The reported accuracy of the baseline is 76.16 .But I can not get the reported accuracy. here is my result:
Acc@1 75.878 Acc@5 92.856
Acc@1 75.382 Acc@5 92.574
Acc@1 75.490 Acc@5 92.714

my environment is :
cuda 11.3
pyroch 1.10.2
torchvision 0.11.3

all the setting are default, I just command:
torchrun --nproc_per_node=8 train.py --model resnet50

cc @datumbox

@datumbox
Copy link
Contributor

@Ezra-Yu Thanks for reporting.

The baseline accuracy reported on the blogpost is for the previously released pretrained model of TorchVision. I verified its accuracy by:

torchrun --nproc_per_node=1 train.py --test-only --prototype --weights ResNet50_Weights.IMAGENET1K_V1 --model resnet50 -b 1
Acc@1 76.132 Acc@5 92.864

Note that minor differences on the 2nd decimal are expected (see #4559), but the accuracy of the released pre-trained model matches.

According to the references, you are using the correct command / hyperparams for training the model. So the listed command should give you a similar model. The difference that you report is significant but so does the variance of the reported accuracies that you get. This is because this model is trained for very few epochs and hasn't fully converged. To further investigate the situation, I tried to locate the training log of the released model but unfortunately as this has been trained many years ago and the person who trained it is no longer with our team I wasn't able to locate it. I think the most likely scenario is that the person who trained it got lucky and got a good initialization value. This can be confirmed by doing a few more runs.

Concerning any other accuracy reported on the blogpost, the models have been trained in 2021 using the scripts of our repo and I can confirm these are the numbers we got on our runs. You should be able to fully reproduce them with the specific scripts.

I'm going to close the issue to keep things tidy but if you have concerns please feel free to reopen. Thanks!

@Ezra-Yu
Copy link
Author

Ezra-Yu commented Mar 11, 2022

Thank you for your quick reply.

So, do you mean that from LR optimizations step in the blog, I can fully reproduce the reported result?

@datumbox
Copy link
Contributor

datumbox commented Mar 11, 2022

If you run this command, you should be able to fully reproduce the updated result:

torchrun --nproc_per_node=8 train.py --model resnet50 --batch-size 128 --lr 0.5 \
--lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear \
--auto-augment ta_wide --epochs 600 --random-erase 0.1 --weight-decay 0.00002 \
--norm-weight-decay 0.0 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 \
--train-crop-size 176 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4

People from the community already have successfully reproduced it and actually improved upon it. See #5201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants