Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The result in paper could not be reproduced #6

Open
eafn opened this issue Jun 23, 2023 · 13 comments
Open

The result in paper could not be reproduced #6

eafn opened this issue Jun 23, 2023 · 13 comments

Comments

@eafn
Copy link

eafn commented Jun 23, 2023

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

@GANPerf
Copy link
Owner

GANPerf commented Jun 26, 2023

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

Repository owner deleted a comment from andayangyang Jun 26, 2023
@eafn
Copy link
Author

eafn commented Jun 26, 2023

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

Specifically, when I tried to reproduce the baseline results (only IN1K-pretrained resnet50 without any ssl-pretrained) using main_lincls.py, I found that the baseline results were significantly different from those reported in the paper. According to the given parameter settings(lr=30 bs=256 wd=0 schedule=[60,80]), the Linear Evaluation results only reached 63.85, while the KNN results( with setting topk=200 t=0.1) were 46.37. This discrepancy from the experimental results reported in the paper is quite substantial. Despite adjusting various learning rates and batch sizes(lr=0.1 bs=256), the highest Linear Evaluation result I have been able to achieve thus far is only 65.68.

@eafn
Copy link
Author

eafn commented Jun 26, 2023

In addition,Linear Evaluation both on 2 an 4 GPUS have the same result.

@GANPerf
Copy link
Owner

GANPerf commented Jun 26, 2023

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

Specifically, when I tried to reproduce the baseline results (only IN1K-pretrained resnet50 without any ssl-pretrained) using main_lincls.py, I found that the baseline results were significantly different from those reported in the paper. According to the given parameter settings(lr=30 bs=256 wd=0 schedule=[60,80]), the Linear Evaluation results only reached 63.85, while the KNN results( with setting topk=200 t=0.1) were 46.37. This discrepancy from the experimental results reported in the paper is quite substantial. Despite adjusting various learning rates and batch sizes(lr=0.1 bs=256), the highest Linear Evaluation result I have been able to achieve thus far is only 65.68.

I appreciate your interest. Could you please let me know which checkpoint you have been using? How about the acc1 performance on StanfordCars & Aircraft? To attain the highest acc1 accuracy, it is advisable to consider selecting the checkpoint with the best retrieval performance, rather than relying solely on the last epoch during the pretraining process. For your reference, I have provided the checkpoint at the following link: Checkpoint Link

@eafn
Copy link
Author

eafn commented Jun 26, 2023

Thank you for your reply. I only tested on CUB and only tested the performance of the baseline (only IN1k-pretrained) instead of your model. I don't quite understand why the retrieval rank1 baseline results are so different. My result is 46 while the reported result in the paper is 10.65. Does the Retrieval refer to the KNN classification (topk=200)?

@eafn
Copy link
Author

eafn commented Jun 26, 2023

Besides, the Linear Evaluation Result is the highest performance during the LR training. And the KNN result without training.

@GANPerf
Copy link
Owner

GANPerf commented Jun 26, 2023

Besides, the Linear Evaluation Result is the highest performance during the LR training. And the KNN result without training.

It appears that there is a distinction between our retrieval rank 1 metric and the KNN approach. In our rank 1 metric, we consider the anchor and the feature with the smallest distance as sharing the same label. Hence, the rule of the minority obeying the majority does not apply in this case.

@eafn
Copy link
Author

eafn commented Jun 26, 2023

Are you suggesting that topk setting is 1 in KNN with rank 1 metric? I have tried that yesterday, but the result is 43.

@eafn
Copy link
Author

eafn commented Jun 26, 2023

Could you please provide the retrieval rank 1 metric code? Thx.

@GANPerf
Copy link
Owner

GANPerf commented Jun 26, 2023

Could you please provide the retrieval rank 1 metric code? Thx.

already provided

@eafn
Copy link
Author

eafn commented Jun 26, 2023

thx, i will try.

@eafn
Copy link
Author

eafn commented Jun 26, 2023

Alright, I have figured out what's going on. There is a fatal flaw in your retrieval experiment process.

Specifically, when testing the baseline with the provided validation code, the rank 1 accuracy is 48. However, if the nn.functional.normalize function is removed from the model's inference function, the accuracy drops to 10.5, which is consistent with the results reported in your paper.

In summary, you did not normalize the features when validating the baseline , but did normalize them when validating your method, which is the root cause of the problem. Retrieval experiments/KNN based on cosine similarity require feature normalization. You can try to verify what I've said. If it's correct, I hope you can update your experimental results on arxiv, as this will seriously affect future research efforts.

Thank you for taking the time to consider my request, and I look forward to hearing from you soon.

@GANPerf
Copy link
Owner

GANPerf commented Jun 27, 2023

Alright, I have figured out what's going on. There is a fatal flaw in your retrieval experiment process.

Specifically, when testing the baseline with the provided validation code, the rank 1 accuracy is 48. However, if the nn.functional.normalize function is removed from the model's inference function, the accuracy drops to 10.5, which is consistent with the results reported in your paper.

In summary, you did not normalize the features when validating the baseline , but did normalize them when validating your method, which is the root cause of the problem. Retrieval experiments/KNN based on cosine similarity require feature normalization. You can try to verify what I've said. If it's correct, I hope you can update your experimental results on arxiv, as this will seriously affect future research efforts.

Thank you for taking the time to consider my request, and I look forward to hearing from you soon.

Thank you, Yifan. I will thoroughly investigate this matter. If it is the case, we will update our Arxiv paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants