The result in paper could not be reproduced #6

eafn · 2023-06-23T03:35:42Z

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

GANPerf · 2023-06-26T05:02:55Z

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

eafn · 2023-06-26T05:12:22Z

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

Specifically, when I tried to reproduce the baseline results (only IN1K-pretrained resnet50 without any ssl-pretrained) using main_lincls.py, I found that the baseline results were significantly different from those reported in the paper. According to the given parameter settings(lr=30 bs=256 wd=0 schedule=[60,80]), the Linear Evaluation results only reached 63.85, while the KNN results( with setting topk=200 t=0.1) were 46.37. This discrepancy from the experimental results reported in the paper is quite substantial. Despite adjusting various learning rates and batch sizes(lr=0.1 bs=256), the highest Linear Evaluation result I have been able to achieve thus far is only 65.68.

eafn · 2023-06-26T05:19:18Z

In addition，Linear Evaluation both on 2 an 4 GPUS have the same result.

GANPerf · 2023-06-26T05:21:23Z

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

Specifically, when I tried to reproduce the baseline results (only IN1K-pretrained resnet50 without any ssl-pretrained) using main_lincls.py, I found that the baseline results were significantly different from those reported in the paper. According to the given parameter settings(lr=30 bs=256 wd=0 schedule=[60,80]), the Linear Evaluation results only reached 63.85, while the KNN results( with setting topk=200 t=0.1) were 46.37. This discrepancy from the experimental results reported in the paper is quite substantial. Despite adjusting various learning rates and batch sizes(lr=0.1 bs=256), the highest Linear Evaluation result I have been able to achieve thus far is only 65.68.

I appreciate your interest. Could you please let me know which checkpoint you have been using? How about the acc1 performance on StanfordCars & Aircraft? To attain the highest acc1 accuracy, it is advisable to consider selecting the checkpoint with the best retrieval performance, rather than relying solely on the last epoch during the pretraining process. For your reference, I have provided the checkpoint at the following link: Checkpoint Link

eafn · 2023-06-26T05:30:05Z

Thank you for your reply. I only tested on CUB and only tested the performance of the baseline (only IN1k-pretrained) instead of your model. I don't quite understand why the retrieval rank1 baseline results are so different. My result is 46 while the reported result in the paper is 10.65. Does the Retrieval refer to the KNN classification (topk=200)?

eafn · 2023-06-26T05:36:10Z

Besides, the Linear Evaluation Result is the highest performance during the LR training. And the KNN result without training.

GANPerf · 2023-06-26T05:43:55Z

Besides, the Linear Evaluation Result is the highest performance during the LR training. And the KNN result without training.

It appears that there is a distinction between our retrieval rank 1 metric and the KNN approach. In our rank 1 metric, we consider the anchor and the feature with the smallest distance as sharing the same label. Hence, the rule of the minority obeying the majority does not apply in this case.

eafn · 2023-06-26T05:50:48Z

Are you suggesting that topk setting is 1 in KNN with rank 1 metric? I have tried that yesterday, but the result is 43.

eafn · 2023-06-26T05:54:02Z

Could you please provide the retrieval rank 1 metric code? Thx.

GANPerf · 2023-06-26T05:56:34Z

Could you please provide the retrieval rank 1 metric code? Thx.

already provided

eafn · 2023-06-26T05:58:16Z

thx, i will try.

eafn · 2023-06-26T09:43:27Z

Alright, I have figured out what's going on. There is a fatal flaw in your retrieval experiment process.

Specifically, when testing the baseline with the provided validation code, the rank 1 accuracy is 48. However, if the nn.functional.normalize function is removed from the model's inference function, the accuracy drops to 10.5, which is consistent with the results reported in your paper.

In summary, you did not normalize the features when validating the baseline , but did normalize them when validating your method, which is the root cause of the problem. Retrieval experiments/KNN based on cosine similarity require feature normalization. You can try to verify what I've said. If it's correct, I hope you can update your experimental results on arxiv, as this will seriously affect future research efforts.

Thank you for taking the time to consider my request, and I look forward to hearing from you soon.

GANPerf · 2023-06-27T03:06:59Z

Alright, I have figured out what's going on. There is a fatal flaw in your retrieval experiment process.

Specifically, when testing the baseline with the provided validation code, the rank 1 accuracy is 48. However, if the nn.functional.normalize function is removed from the model's inference function, the accuracy drops to 10.5, which is consistent with the results reported in your paper.

In summary, you did not normalize the features when validating the baseline , but did normalize them when validating your method, which is the root cause of the problem. Retrieval experiments/KNN based on cosine similarity require feature normalization. You can try to verify what I've said. If it's correct, I hope you can update your experimental results on arxiv, as this will seriously affect future research efforts.

Thank you for taking the time to consider my request, and I look forward to hearing from you soon.

Thank you, Yifan. I will thoroughly investigate this matter. If it is the case, we will update our Arxiv paper.

Repository owner deleted a comment from andayangyang Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The result in paper could not be reproduced #6

The result in paper could not be reproduced #6

eafn commented Jun 23, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 27, 2023

The result in paper could not be reproduced #6

The result in paper could not be reproduced #6

Comments

eafn commented Jun 23, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 26, 2023

eafn commented Jun 26, 2023

eafn commented Jun 26, 2023

GANPerf commented Jun 27, 2023