You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, I used GPT to do generation with DART dataset. However, I found that the test set may be different from other works. In fact, I can only get 5,097 samples for testing, while GEM website says their test set is 12,552. And the data provied in (Li, et al 2021) (https://github.com/XiangLi1999/PrefixTuning) also has 12,552 samples but they do not provide gold references.
Through the official evaluation scripts and test set, I obtain about 37-38 BLEU, which is much lower than the results (46-47 BLEU) reported by (Li, et al 2021) and other works (like the leaderboard in github: https://github.com/Yale-LILY/dart). So, I am confused that which one is right.
Could you please answer these questions if possible? I will be appreciate.
Reference
Li X L, Liang P. Prefix-tuning: Optimizing continuous prompts for generation[J]. arXiv preprint arXiv:2101.00190, 2021.
The text was updated successfully, but these errors were encountered:
Recently, I used GPT to do generation with DART dataset. However, I found that the test set may be different from other works. In fact, I can only get 5,097 samples for testing, while GEM website says their test set is 12,552. And the data provied in (Li, et al 2021) (https://github.com/XiangLi1999/PrefixTuning) also has 12,552 samples but they do not provide gold references.
Through the official evaluation scripts and test set, I obtain about 37-38 BLEU, which is much lower than the results (46-47 BLEU) reported by (Li, et al 2021) and other works (like the leaderboard in github: https://github.com/Yale-LILY/dart). So, I am confused that which one is right.
Could you please answer these questions if possible? I will be appreciate.
Reference
The text was updated successfully, but these errors were encountered: