Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pegasus: replication and distillation results #6844

Closed
sshleifer opened this issue Aug 31, 2020 · 21 comments
Closed

Pegasus: replication and distillation results #6844

sshleifer opened this issue Aug 31, 2020 · 21 comments
Assignees
Labels
Help wanted Extra attention is needed, help appreciated Replication

Comments

@sshleifer
Copy link
Contributor

sshleifer commented Aug 31, 2020

Replication

link

mixed & stochastic column of this table

dataset Authors This Repo best bart best bart name
xsum 47.60/24.83/39.64 46.87/24.46/39.15 22.32/37.39 distilbart-xsum-12-6
cnn_dailymail 44.16/21.56/41.30 see comment 21.26/30.59 distilbart-cnn-12-6
newsroom 45.07/33.39/41.28 41.03/29.83/36.96
multi_news 47.65/18.75/24.95 47.58/19.0/24.77
gigaword 39.65/20.47/36.76 39.79/20.56/36.80
wikihow 46.39/22.12/38.41 * 46.85/23.64/28.73
reddit_tifu 27.99/9.81/22.94 32.75/11.68/24.97
big_patent 52.29/33.08/41.66 *
arxiv 44.21/16.95/25.67 44.83/17.34/25.60
pubmed 45.97/20.15/28.25 45.40/19.42/26.93
aeslc 37.68/21.25/36.51 37.09/21.40/35.93
billsum 59.67/41.58/47.59 56.18/39.94/45.39
  • (* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

Final Update (2020-10-16)

Mission accomplished thanks to the work of @patil-suraj, and @stas00 !

The above table now shows that our results are close enough.
We suspect differences are due to treatment of the <n> character that pegasus generates and slightly different beam search implementations.

Link to Spreadsheet with timing data

Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.

@sshleifer sshleifer changed the title google/pegasus-cnn_dailymail replication Pegasus: replication results Aug 31, 2020
@sshleifer sshleifer self-assigned this Aug 31, 2020
@sshleifer
Copy link
Contributor Author

sshleifer commented Aug 31, 2020

If anyone wants to help, evaluate on a dataset where the third column is not filled it.
Steps:
First, download the data from nlp package, save to disk in format described in https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py

Helper function for run_eval

gen_test_hub_summ () {
	# need to add --fp16 and --bs = whatever
    model=$1
    DATA_DIR=$2
    echo $DATA_DIR
	save_dir=$3
	mkdir -p $save_dir
	shift
    shift
    shift
    python run_eval.py $model $DATA_DIR/test.source $save_dir/test_gens.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_rouge.json --task summarization $@
}

Then Roughly:

cd examples/seq2seq
gen_test_hub_summ google/pegasus-{dataset} dataset  {dataset}_results --bs 4

Leave the results, as well as any observations about truncation produced summaries as a comment in this issue!

@sshleifer sshleifer added Help wanted Extra attention is needed, help appreciated Replication labels Aug 31, 2020
@sshleifer
Copy link
Contributor Author

sshleifer commented Sep 5, 2020

CNN Dailymail

One possible reason for replication issue is that our beam search logic differs from the original, causing 16% of the summaries to be truncated.

Finetuning with our finetuning code and --max_target_length=142 partially fixes this issue:

  • Can get a distilled version (16-4) 43.23/21.29/31.3 .436 S/sample (released at sshleifer/dpx-cnn-16-4)
  • Can finetune the 16-16 pegasus-cnn checkpoint to get 44.13/21.37/30.94 1.4S/Sample (0.2 Rouge2 behind published.) ( sshleifer/pegasus-cnn-ft-v2)
  • original google/pegasus-cnn_dailymail scored 20.73 Rouge 2.
  • For both of these finetuned models, >99.8% of generations end in punctuation.

XSUM

sshleifer/distill-pegasus-xsum-16-4

{"rouge1": 44.942, "rouge2": 23.0412, "rougeL": 37.8579,
 "n_obs": 11333, "seconds_per_sample": 0.1972, "batch_size": 16}

Teacher metrics (I don't remember batch size):

{"rouge1": 46.8773, "rouge2": 24.46, "rougeL": 39.1507, 
"n_obs": 11328,  "seconds_per_sample": 0.3308}

@sshleifer sshleifer changed the title Pegasus: replication results Pegasus: replication and distillation results Sep 14, 2020
@sshleifer
Copy link
Contributor Author

I intend to post a writeup on distillation techniques at some point before Oct 15!

@sshleifer
Copy link
Contributor Author

Re: replication, best download strategy maybe to start with
https://github.com/google-research/pegasus/blob/master/pegasus/data/public_datasets_test.py and modify.

@sshleifer
Copy link
Contributor Author

sshleifer commented Sep 22, 2020

Cnn update:

  • I believe we have a preprocessing issue. Ported models generate the <n> token at the beginning of sentences, whereas ours do not. The pegasus original code replaces newline symbol with <n>. PegasusTokenizer should probably do this: PegasusTokenizer: Newline symbol #7327

@fajri91
Copy link
Contributor

fajri91 commented Sep 26, 2020

For CNNDM, I can get this score with google/pegasus-cnn_dailymail model.

ROUGE-1:
rouge_1_f_score: 0.4436 with confidence interval (0.4413, 0.4459)
rouge_1_recall: 0.4825 with confidence interval (0.4797, 0.4853)
rouge_1_precision: 0.4368 with confidence interval (0.4339, 0.4395)

ROUGE-2:
rouge_2_f_score: 0.2145 with confidence interval (0.2120, 0.2170)
rouge_2_recall: 0.2323 with confidence interval (0.2297, 0.2350)
rouge_2_precision: 0.2124 with confidence interval (0.2097, 0.2150)

ROUGE-l:
rouge_l_f_score: 0.4141 with confidence interval (0.4118, 0.4165)
rouge_l_recall: 0.4501 with confidence interval (0.4474, 0.4530)
rouge_l_precision: 0.4079 with confidence interval (0.4051, 0.4106)

Script I run:

./run_eval.py google/pegasus-cnn_dailymail /home/ffajri/Data/huggingface/cnn_dm/test.source pred_cnndm_pegasus.txt \
    --reference_path /home/ffajri/Data/huggingface/cnn_dm/test.target \
    --score_path cnn_rouge.json \
    --task summarization \
    --device cuda \
    --max_source_length 512 \
    --max_target_length 128 \
    --bs 4

I notice the first R1 output from the transformer is 43.xx something, but I recalculate ROUGE (to get the scores above) as follows:

  1. First, I replace <n> with \n in the decoding results. (as you said above)
  2. I don't use the gold summary provided by huggingface because sentences are not separated by the newline character. I think its necessary to separate sentences in the gold summary. So I use the original gold test set from See et al., 2017 to compute ROUGE.
  3. I lower case all decoded and gold summary (but not sure if it really affects the ROUGE score)
  4. I calculate ROUGE with the pyrouge code (not the ROUGE in transformer)

Hope it can help the fix.

@sshleifer
Copy link
Contributor Author

sshleifer commented Sep 26, 2020

Would you be willing to share a few lines of

cnn_dm/test.source, pred_cnndm_pegasus.txt, and cnn_dm/test.target

Thanks!

@fajri91
Copy link
Contributor

fajri91 commented Sep 26, 2020

Hi, for inference, I use the same set from huggingface

test.source
Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." ............

test.target
Marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports . Journalists at Bild and Paris Match are "very confident" the video clip is real, an editor says . Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says .

pred_cnndm_pegasus.txt (Result)
"A person who has such a video needs to immediately give it to the investigators," prosecutor says .<n>"It is a very disturbing scene," editor-in-chief of Bild online tells "Erin Burnett: Outfront"

Then, I got R1 = 43.xx (as the ./run_eval.py output)

To get the R1 = 44.xx, I separately calculate ROUGE (pyrouge) with:

test.target from See et al., 2017
marseille prosecutor says '' so far no videos were used in the crash investigation '' despite media reports .\njournalists at bild and paris match are '' very confident '' the video clip is real , an editor says .\nandreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

updated pred_cnndm_pegasus.txt
"a person who has such a video needs to immediately give it to the investigators," prosecutor says .\n"it is a very disturbing scene," editor-in-chief of bild online tells "erin burnett: outfront"

Both now have \n which I think is necessary for calculating ROUGE.

@sshleifer
Copy link
Contributor Author

sshleifer commented Oct 2, 2020

We fixed our calculate_rouge_score to address the \n issue and now we are getting

44.31/21.53/41.15 for sshleifer/pegasus-cnn-ft-v2! Thanks for the help!

@sshleifer
Copy link
Contributor Author

Updated the table in the Issue description with most recent results after the calculate_rouge_fix
Moving forward, questions about specific results should be asked on the forums or in a separate issue with @stas00, @patil-suraj, and @sshleifer tagged.

@paulrich1234
Copy link

hi guys :

is there code to pretrainning the model used for my own data .
Thank you

@yaozhaogoogle
Copy link

yaozhaogoogle commented Feb 19, 2021

Thank you for reproducing this results!
Regarding the treatment of the <n>, newline char "\n" in input text are being replaced by "<n>" and vice versa for the output.

@nguyentthong
Copy link

nguyentthong commented Mar 2, 2021

I have tried around 10 sets of hyperparameters and only achieved nearly worse results. (ROUGE-1 ~ 43.9, for CNN/DailyMail) These are options of my experiments:

  • Optimizer: Adafactor <-> AdamW
  • Learning rate: 5e-4 <-> 1e-4
  • Batch size: 4
  • Gradient accumulation steps: 1 <-> 8 <-> 64
  • Accelarator: dp <-> ddp
  • Epochs: 20 - 80 (after around 10 epochs it started to overfit (val loss increases))
  • Datasets: both old and new versions (old version doesn't consist of
    <n> in the target summary)

I don't know what to continue, can someone tell me what my problems are?

@patil-suraj
Copy link
Contributor

Hi @thongnguyen050999

See if this comment above helps
#6844 (comment)

@nguyentthong
Copy link

nguyentthong commented Mar 2, 2021

Hi @patil-suraj,

Yes, I did notice that, these are my results:

  • Sentence ends with "<n>": ROUGE-1: 45.94, ROUGE-L: 32.24
  • Sentence ends with "\n": ROUGE-1: 43.96, ROUGE-L: 40.87

@nguyentthong
Copy link

Are my results reasonable (representing the expected outcome)? :-)

@fajri91
Copy link
Contributor

fajri91 commented Mar 3, 2021

Are my results reasonable (representing the expected outcome)? :-)

Hi, can you please tell me a bit about what do you want to achieve? and which pre-trained Pegasus model are you currently using? It seems to me you are not doing only inference but some fine-tuning of the Pegasus model (based on your hyperparameter)?

@nguyentthong
Copy link

Yes, here is my experiment description:

  • Goal: I want to reproduce the results from the Pegasus paper (in the future I might add some changes based upon the baseline 🧑‍🎓 ), in which I finetuned from the pretrained checkpoint
  • Pretrained model I use: google/pegasus-large

@fajri91
Copy link
Contributor

fajri91 commented Mar 3, 2021

I guess google/pegasus-large in huggingface is a Mixed & Stochastic model where we expect to have 44.16/21.56/41.30 (which is slightly lower than your current score).

Have you tried to set the hyperparameter of the original implementation? You can check it here.

The primary hyperparameter will be this:
"max_input_len": 1024, --> (longer text)
"max_output_len": 128,
"train_steps": 210000,
"learning_rate": 0.001,
"batch_size": 8,

You probably want to follow their hyperparameter for inference as well (e.g. beam size etc)

@nguyentthong
Copy link

nguyentthong commented Mar 7, 2021

Hi @fajri91, I have tried your suggestion and achieved the following results after 210k steps:

  • Huggingface version:
  • ROUGE-1 = 43.2011
  • ROUGE-L = 39.99
  • Google version (I ran their default code without modifications)
  • ROUGE-1 = 43.01
  • ROUGE-L = 39.92

@xu1998hz
Copy link

xu1998hz commented May 3, 2022

Replication

link

mixed & stochastic column of this table

dataset Authors This Repo best bart best bart name
xsum 47.60/24.83/39.64 46.87/24.46/39.15 22.32/37.39 distilbart-xsum-12-6
cnn_dailymail 44.16/21.56/41.30 see comment 21.26/30.59 distilbart-cnn-12-6
newsroom 45.07/33.39/41.28 41.03/29.83/36.96
multi_news 47.65/18.75/24.95 47.58/19.0/24.77
gigaword 39.65/20.47/36.76 39.79/20.56/36.80
wikihow 46.39/22.12/38.41 * 46.85/23.64/28.73
reddit_tifu 27.99/9.81/22.94 32.75/11.68/24.97
big_patent 52.29/33.08/41.66 *
arxiv 44.21/16.95/25.67 44.83/17.34/25.60
pubmed 45.97/20.15/28.25 45.40/19.42/26.93
aeslc 37.68/21.25/36.51 37.09/21.40/35.93
billsum 59.67/41.58/47.59 56.18/39.94/45.39

  • (* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

Final Update (2020-10-16)

Mission accomplished thanks to the work of @patil-suraj, and @stas00 !

The above table now shows that our results are close enough. We suspect differences are due to treatment of the <n> character that pegasus generates and slightly different beam search implementations.

Link to Spreadsheet with timing data

Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.

Hi Sam, I have a quick question regarding to obtain the results for Gigaword using checkpoint "google/pegasus-gigaword" provided by Google. Currently, I followed a very simple setup using "google/pegasus-gigaword" and follow directly from huggingface default codes in generating gigaword summary. For dataset, I directly load 'gigaword' from datasets library without pre-processing. I currently use rouge_score library to compute the rouge score. However, my results evaluating on 1951 test samples in Gigaword deviates almost 10 rouge points (rouge1, rouge2, rougel: 28, 12 and 25 vs 39.79/20.56/36.80). Is it OK if you can share your setup in reproducing your experiment.

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help wanted Extra attention is needed, help appreciated Replication
Projects
None yet
Development

No branches or pull requests

7 participants