-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to Reproduce the dev score of GENRE Document Retrieval #90
Comments
Training seems correct. Are you using constrained search during the evaluation? |
Also, with BLINK you do not need to use convert_kilt_to_fairseq. |
Thanks for your quick response.
Am I doing these the right way? Thanks again! |
Yes, you are doing it correctly then. I am not sure what is going wrong. Are you sure you are training with the same batch size and number of steps as reported in the paper? |
Yes, I rerun the whole finetune process on 8 V100 GPUs torch1.6.0+cuda10.1. I directly use the training script train.sh with max-tokens per GPU to 1024, update-freq to 128, max-update to 200000, which should be the same hypermeters reported in the appendix A.3. I get the following results.
T-REx, zsRE, NQ, HoPo, TQA are still lower than expected. |
That is weird, but I do not know how to help. I do not work at Facebook/ Meta anymore, so I cannot re-run experiments or check the original code that was launched. Note: I run on more GPUs. |
Hi, I was trying to reproduce the Page-level Document Retrieval of GENRE. But the dev score is significantly lower than the model you provided fairseq_wikipage_retrieval.
Here are my details for training:
Training set: Following Section 4.1 in the paper, I mix and shuffle the BLINK & 8 KILT jsonl training files to a single file, and using scripts convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process the training file.
Dev set: I just cat all 11 KILT dev jsonl files to one single jsonl file, then used the same process mentioned above to process it.
Training Hypermeters: I use the script train.sh for training. I set the keep-best-checkpoints=1 to save the model that performs best on the dev set.
Following Appendix A.3, I notice that 128 GPUs were used with max-tokens=1024 and update-freq=1. I use 16 GPUs for training, so I use max-tokens=8192 to keep the Total max tokens per update=128*1024.
Here are the dev results of the model you provided fairseq_wikipage_retrieval and my own reproduced model for KILT.
The results for TREX, structured_zeroshot, NQ, and HotpotQA are lower than the model you provided. Could you give me some help to find out anything wrong?
Thank you very much.
@nicola-decao
The text was updated successfully, but these errors were encountered: