Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the reported HM number (60.5) of CGE on UT-Zap50K? #4

Open
HeimingX opened this issue May 20, 2021 · 4 comments
Open
Assignees

Comments

@HeimingX
Copy link

Hi,

Thanks for the impressive paper and high quality open-resourced codebase.

I ran an experiment on zappos with this codebase (w/o any editing) and found that there is a big gap between the testset HM number(47.11) of CGE to the reported one(60.5) in the paper. The following is the eval log on test set

image

Furthermore, from this log, although the test auc number is close to the reported one(33.5), but the best unseen(66.05) also has a big gap to the reported one(71.5).

I am a little bit confused about these number gaps and could you please be kind to give some explainations. Thanks a lot.

@mancinimassimiliano mancinimassimiliano assigned ferjad and unassigned ferjad May 20, 2021
@ferjad
Copy link
Collaborator

ferjad commented May 20, 2021

@HeimingX thanks for the interest in our work and the kind words.
Regarding metrics, AUC is the most stable one since it measures the relative change between seen and unseen accuracy. The best harmonic mean is dependent on the bias each run has between seen and unseen accuracy. The discretization of the difference to find bias points can make the results vary and can hence affect the numbers. We found this to be most prevalent on UT Zappos. We reported the best number we got across multiple runs consistent with older works.
If you plan on working on this topic, I will recommend sticking with either MIT-States or C-GQA and extend it to UT-Zappos afterward. Some of UT-Zappos states like Leather vs Synthetic leather are material differences that are not always visible as visual transformations, we discussed this in our paper.

@HeimingX
Copy link
Author

HeimingX commented May 21, 2021

Hi Ferjad,

Thanks for the timely response and detailed explaination.

However, I still have some concerns:

  1. Actually, I have run multiple times on UT Zappos with CGE method(since it is quick to arrive at the performance peak and seems to overfit after that) but none of these runs acheive such high HM number (most of the results are around 50). I wonder if it is possible for you to publish the model ckpt that you have achieved the reported number? And since the results tend to have a big variance, mean&std seems to be a necessary metric.

  2. Regarding the suggestions about datasets, MIT-States is argued to have label noises(both in CGE paper and [1]) and the newly proposed dataset C-GQA seems to have an incomplete training set (as proposed in this issue):

  1. In training data, 1371 pairs out of 6963 train pairs have no data. 40 out of 453 attributes have no data and 196 out of 870 objects have no data.
  2. In the validation set, 133 pairs comes from the training pairs w/o training data.
  3. In the test set, 134 pairs comes from the training pairs w/o training data.
    From my view, it would be hard for a model to generalize to a new composition without seeing the corresponding attribute or object before and it seems to be beyond the scope of current research on czsl. I am not sure if I have understanded it corrrectly, could you please give more explainations. Thanks a lot.

[1]: A causal view of compositional zero-shot recognition

@zhaohengz
Copy link

Hi Heiming,

If you don't mind me asking, would you be able to replicate the HM results on Zappos? I tried several times and also got ~50

Thanks,
Zhaoheng

@HeimingX
Copy link
Author

Hi,

Not yet.

I just found the reported testset HM number is some how close to the results on validation set but the reported/reproduced AUC on test set are comparable. It is quite weird. Not sure if any errorediting happens in the paper.

Zappos AUC HM Seen Unseen
test set (reported) 33.5 60.5 64.5 71.5
val set(reported) 43.2 - - -
test set (reproduced) 33.4 48.3 61.9 67.9
val set(reproduced) 41.4 56.9 63.6 71.2

Look forward to the author's feedback~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants