-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pegasus: replication and distillation results #6844
Comments
If anyone wants to help, evaluate on a dataset where the third column is not filled it. Helper function for run_eval gen_test_hub_summ () {
# need to add --fp16 and --bs = whatever
model=$1
DATA_DIR=$2
echo $DATA_DIR
save_dir=$3
mkdir -p $save_dir
shift
shift
shift
python run_eval.py $model $DATA_DIR/test.source $save_dir/test_gens.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_rouge.json --task summarization $@
}
Then Roughly:
Leave the results, as well as any observations about truncation produced summaries as a comment in this issue! |
CNN DailymailOne possible reason for replication issue is that our beam search logic differs from the original, causing 16% of the summaries to be truncated. Finetuning with our finetuning code and
XSUM
Teacher metrics (I don't remember batch size):
|
I intend to post a writeup on distillation techniques at some point before Oct 15! |
Re: replication, best download strategy maybe to start with |
Cnn update:
|
For CNNDM, I can get this score with
Script I run:
I notice the first R1 output from the transformer is 43.xx something, but I recalculate ROUGE (to get the scores above) as follows:
Hope it can help the fix. |
Would you be willing to share a few lines of
Thanks! |
Hi, for inference, I use the same set from
Then, I got R1 = 43.xx (as the To get the R1 = 44.xx, I separately calculate ROUGE (pyrouge) with:
updated Both now have |
We fixed our 44.31/21.53/41.15 for |
Updated the table in the Issue description with most recent results after the |
hi guys : is there code to pretrainning the model used for my own data . |
Thank you for reproducing this results! |
I have tried around 10 sets of hyperparameters and only achieved nearly worse results. (ROUGE-1 ~ 43.9, for CNN/DailyMail) These are options of my experiments:
I don't know what to continue, can someone tell me what my problems are? |
See if this comment above helps |
Hi @patil-suraj, Yes, I did notice that, these are my results:
|
Are my results reasonable (representing the expected outcome)? :-) |
Hi, can you please tell me a bit about what do you want to achieve? and which pre-trained Pegasus model are you currently using? It seems to me you are not doing only inference but some fine-tuning of the Pegasus model (based on your hyperparameter)? |
Yes, here is my experiment description:
|
I guess Have you tried to set the hyperparameter of the original implementation? You can check it here. The primary hyperparameter will be this: You probably want to follow their hyperparameter for inference as well (e.g. beam size etc) |
Hi @fajri91, I have tried your suggestion and achieved the following results after 210k steps:
|
Hi Sam, I have a quick question regarding to obtain the results for Gigaword using checkpoint "google/pegasus-gigaword" provided by Google. Currently, I followed a very simple setup using "google/pegasus-gigaword" and follow directly from huggingface default codes in generating gigaword summary. For dataset, I directly load 'gigaword' from datasets library without pre-processing. I currently use rouge_score library to compute the rouge score. However, my results evaluating on 1951 test samples in Gigaword deviates almost 10 rouge points (rouge1, rouge2, rougel: 28, 12 and 25 vs 39.79/20.56/36.80). Is it OK if you can share your setup in reproducing your experiment. Thanks in advance! |
Replication
link
mixed & stochastic column of this table
Final Update (2020-10-16)
Mission accomplished thanks to the work of @patil-suraj, and @stas00 !
The above table now shows that our results are close enough.
We suspect differences are due to treatment of the
<n>
character that pegasus generates and slightly different beam search implementations.Link to Spreadsheet with timing data
Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.
The text was updated successfully, but these errors were encountered: