Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to Reproduce the dev score of GENRE Document Retrieval #90

Open
ma787639046 opened this issue Sep 28, 2022 · 7 comments
Open

Fail to Reproduce the dev score of GENRE Document Retrieval #90

ma787639046 opened this issue Sep 28, 2022 · 7 comments

Comments

@ma787639046
Copy link

ma787639046 commented Sep 28, 2022

Hi, I was trying to reproduce the Page-level Document Retrieval of GENRE. But the dev score is significantly lower than the model you provided fairseq_wikipage_retrieval.

Here are my details for training:

Training set: Following Section 4.1 in the paper, I mix and shuffle the BLINK & 8 KILT jsonl training files to a single file, and using scripts convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process the training file.

Dev set: I just cat all 11 KILT dev jsonl files to one single jsonl file, then used the same process mentioned above to process it.

Training Hypermeters: I use the script train.sh for training. I set the keep-best-checkpoints=1 to save the model that performs best on the dev set.

Following Appendix A.3, I notice that 128 GPUs were used with max-tokens=1024 and update-freq=1. I use 16 GPUs for training, so I use max-tokens=8192 to keep the Total max tokens per update=128*1024.

Here are the dev results of the model you provided fairseq_wikipage_retrieval and my own reproduced model for KILT.

model_name fever aidayago2 wn cweb trex structured_zerosho nq hotpotqa triviaqa eli5 wow
genre_fairseq_wikipage_retrieval (provided) 0.846907 0.927467 0.876914 0.705305 0.7968 0.948443 0.642228 0.518214 0.71114 0.134705 0.563196
My reproduced model 0.826217 0.927048 0.874264 0.713342 0.716 0.864125 0.576665 0.399821 0.701064 0.13935 0.570727

The results for TREX, structured_zeroshot, NQ, and HotpotQA are lower than the model you provided. Could you give me some help to find out anything wrong?

Thank you very much.
@nicola-decao

@nicola-decao
Copy link
Contributor

Training seems correct. Are you using constrained search during the evaluation?

@nicola-decao
Copy link
Contributor

Also, with BLINK you do not need to use convert_kilt_to_fairseq.

@ma787639046
Copy link
Author

Thanks for your quick response.

  1. I use constrained search during the evaluation. The trie is downloaded from kilt_titles_trie_dict.pkl, and I use evaluate_kilt_dataset.py with beam=10, max_len_a=384, max_len_b=15.

  2. I get the BLINK train set in JSON Line format from blink-train-kilt.jsonl. It seems that this file is structured in the same way as other KILT datasets. So I just cat the blink-train-kilt.jsonl and other 8 KILT train jsonl files mentioned in the paper to one single file. Then I shuffle this JSON Line file with random.shuffle() using python.
    I cat all 11 dev jsonl files of KILT to one file as development set.
    Then use the script convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process above files.

Am I doing these the right way?

Thanks again!

@ma787639046
Copy link
Author

@nicola-decao

@nicola-decao
Copy link
Contributor

Yes, you are doing it correctly then. I am not sure what is going wrong. Are you sure you are training with the same batch size and number of steps as reported in the paper?

@ma787639046
Copy link
Author

Yes, I rerun the whole finetune process on 8 V100 GPUs torch1.6.0+cuda10.1. I directly use the training script train.sh with max-tokens per GPU to 1024, update-freq to 128, max-update to 200000, which should be the same hypermeters reported in the appendix A.3. I get the following results.

model_name FEV AY2 WnWi WnCw T-REx zsRE NQ HoPo TQA ELI5 WoW Avg
genre_fairseq_wikipage_retrieval (provided) 0.84681 0.92747 0.87691 0.7053 0.7968 0.94844 0.64258 0.51821 0.71114 0.1347 0.5632 0.69742
My reproduced model 0.84203 0.92559 0.88516 0.71048 0.7288 0.86198 0.60416 0.40625 0.69938 0.13603 0.58481 0.67133

T-REx, zsRE, NQ, HoPo, TQA are still lower than expected.

@nicola-decao
Copy link
Contributor

nicola-decao commented Oct 18, 2022

That is weird, but I do not know how to help. I do not work at Facebook/ Meta anymore, so I cannot re-run experiments or check the original code that was launched. Note: I run on more GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants