Pegasus: replication and distillation results #6844

sshleifer · 2020-08-31T14:39:14Z

Replication

mixed & stochastic column of this table

dataset	Authors	This Repo	best bart	best bart name
xsum	47.60/24.83/39.64	46.87/24.46/39.15	22.32/37.39	distilbart-xsum-12-6
cnn_dailymail	44.16/21.56/41.30	see comment	21.26/30.59	distilbart-cnn-12-6
newsroom	45.07/33.39/41.28	41.03/29.83/36.96
multi_news	47.65/18.75/24.95	47.58/19.0/24.77
gigaword	39.65/20.47/36.76	39.79/20.56/36.80
wikihow	46.39/22.12/38.41 *	46.85/23.64/28.73
reddit_tifu	27.99/9.81/22.94	32.75/11.68/24.97
big_patent	52.29/33.08/41.66 *
arxiv	44.21/16.95/25.67	44.83/17.34/25.60
pubmed	45.97/20.15/28.25	45.40/19.42/26.93
aeslc	37.68/21.25/36.51	37.09/21.40/35.93
billsum	59.67/41.58/47.59	56.18/39.94/45.39

(* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

Final Update (2020-10-16)

Mission accomplished thanks to the work of @patil-suraj, and @stas00 !

The above table now shows that our results are close enough.
We suspect differences are due to treatment of the <n> character that pegasus generates and slightly different beam search implementations.

Link to Spreadsheet with timing data

Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.

The text was updated successfully, but these errors were encountered:

sshleifer · 2020-08-31T14:48:54Z

If anyone wants to help, evaluate on a dataset where the third column is not filled it.
Steps:
First, download the data from nlp package, save to disk in format described in https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py

Helper function for run_eval

gen_test_hub_summ () {
	# need to add --fp16 and --bs = whatever
    model=$1
    DATA_DIR=$2
    echo $DATA_DIR
	save_dir=$3
	mkdir -p $save_dir
	shift
    shift
    shift
    python run_eval.py $model $DATA_DIR/test.source $save_dir/test_gens.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_rouge.json --task summarization $@
}

Then Roughly:

cd examples/seq2seq
gen_test_hub_summ google/pegasus-{dataset} dataset  {dataset}_results --bs 4

Leave the results, as well as any observations about truncation produced summaries as a comment in this issue!

sshleifer · 2020-09-05T18:13:14Z

CNN Dailymail

One possible reason for replication issue is that our beam search logic differs from the original, causing 16% of the summaries to be truncated.

Finetuning with our finetuning code and --max_target_length=142 partially fixes this issue:

Can get a distilled version (16-4) 43.23/21.29/31.3 .436 S/sample (released at sshleifer/dpx-cnn-16-4)
Can finetune the 16-16 pegasus-cnn checkpoint to get 44.13/21.37/30.94 1.4S/Sample (0.2 Rouge2 behind published.) ( sshleifer/pegasus-cnn-ft-v2)
original google/pegasus-cnn_dailymail scored 20.73 Rouge 2.
For both of these finetuned models, >99.8% of generations end in punctuation.

XSUM

sshleifer/distill-pegasus-xsum-16-4

{"rouge1": 44.942, "rouge2": 23.0412, "rougeL": 37.8579,
 "n_obs": 11333, "seconds_per_sample": 0.1972, "batch_size": 16}

Teacher metrics (I don't remember batch size):

{"rouge1": 46.8773, "rouge2": 24.46, "rougeL": 39.1507, 
"n_obs": 11328,  "seconds_per_sample": 0.3308}

sshleifer · 2020-09-14T18:38:29Z

I intend to post a writeup on distillation techniques at some point before Oct 15!

sshleifer · 2020-09-18T15:26:15Z

Re: replication, best download strategy maybe to start with
https://github.com/google-research/pegasus/blob/master/pegasus/data/public_datasets_test.py and modify.

sshleifer · 2020-09-22T18:00:22Z

Cnn update:

I believe we have a preprocessing issue. Ported models generate the <n> token at the beginning of sentences, whereas ours do not. The pegasus original code replaces newline symbol with <n>. PegasusTokenizer should probably do this: PegasusTokenizer: Newline symbol #7327

fajri91 · 2020-09-26T14:05:32Z

For CNNDM, I can get this score with google/pegasus-cnn_dailymail model.

ROUGE-1:
rouge_1_f_score: 0.4436 with confidence interval (0.4413, 0.4459)
rouge_1_recall: 0.4825 with confidence interval (0.4797, 0.4853)
rouge_1_precision: 0.4368 with confidence interval (0.4339, 0.4395)

ROUGE-2:
rouge_2_f_score: 0.2145 with confidence interval (0.2120, 0.2170)
rouge_2_recall: 0.2323 with confidence interval (0.2297, 0.2350)
rouge_2_precision: 0.2124 with confidence interval (0.2097, 0.2150)

ROUGE-l:
rouge_l_f_score: 0.4141 with confidence interval (0.4118, 0.4165)
rouge_l_recall: 0.4501 with confidence interval (0.4474, 0.4530)
rouge_l_precision: 0.4079 with confidence interval (0.4051, 0.4106)

Script I run:

./run_eval.py google/pegasus-cnn_dailymail /home/ffajri/Data/huggingface/cnn_dm/test.source pred_cnndm_pegasus.txt \
    --reference_path /home/ffajri/Data/huggingface/cnn_dm/test.target \
    --score_path cnn_rouge.json \
    --task summarization \
    --device cuda \
    --max_source_length 512 \
    --max_target_length 128 \
    --bs 4

I notice the first R1 output from the transformer is 43.xx something, but I recalculate ROUGE (to get the scores above) as follows:

First, I replace <n> with \n in the decoding results. (as you said above)
I don't use the gold summary provided by huggingface because sentences are not separated by the newline character. I think its necessary to separate sentences in the gold summary. So I use the original gold test set from See et al., 2017 to compute ROUGE.
I lower case all decoded and gold summary (but not sure if it really affects the ROUGE score)
I calculate ROUGE with the pyrouge code (not the ROUGE in transformer)

Hope it can help the fix.

sshleifer · 2020-09-26T19:44:58Z

Would you be willing to share a few lines of

cnn_dm/test.source, pred_cnndm_pegasus.txt, and cnn_dm/test.target

Thanks!

fajri91 · 2020-09-26T20:55:15Z

Hi, for inference, I use the same set from huggingface

test.source
Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." ............

test.target
Marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports . Journalists at Bild and Paris Match are "very confident" the video clip is real, an editor says . Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says .

pred_cnndm_pegasus.txt (Result)
"A person who has such a video needs to immediately give it to the investigators," prosecutor says .<n>"It is a very disturbing scene," editor-in-chief of Bild online tells "Erin Burnett: Outfront"

Then, I got R1 = 43.xx (as the ./run_eval.py output)

To get the R1 = 44.xx, I separately calculate ROUGE (pyrouge) with:

test.target from See et al., 2017
marseille prosecutor says '' so far no videos were used in the crash investigation '' despite media reports .\njournalists at bild and paris match are '' very confident '' the video clip is real , an editor says .\nandreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

updated pred_cnndm_pegasus.txt
"a person who has such a video needs to immediately give it to the investigators," prosecutor says .\n"it is a very disturbing scene," editor-in-chief of bild online tells "erin burnett: outfront"

Both now have \n which I think is necessary for calculating ROUGE.

sshleifer · 2020-10-02T18:31:11Z

We fixed our calculate_rouge_score to address the \n issue and now we are getting

44.31/21.53/41.15 for sshleifer/pegasus-cnn-ft-v2! Thanks for the help!

sshleifer · 2020-10-16T16:04:42Z

Updated the table in the Issue description with most recent results after the calculate_rouge_fix
Moving forward, questions about specific results should be asked on the forums or in a separate issue with @stas00, @patil-suraj, and @sshleifer tagged.

paulrich1234 · 2021-01-15T08:36:42Z

hi guys ：

is there code to pretrainning the model used for my own data .
Thank you

yaozhaogoogle · 2021-02-19T21:25:19Z

Thank you for reproducing this results!
Regarding the treatment of the <n>, newline char "\n" in input text are being replaced by "<n>" and vice versa for the output.

nguyentthong · 2021-03-02T13:37:13Z

I have tried around 10 sets of hyperparameters and only achieved nearly worse results. (ROUGE-1 ~ 43.9, for CNN/DailyMail) These are options of my experiments:

Optimizer: Adafactor <-> AdamW
Learning rate: 5e-4 <-> 1e-4
Batch size: 4
Gradient accumulation steps: 1 <-> 8 <-> 64
Accelarator: dp <-> ddp
Epochs: 20 - 80 (after around 10 epochs it started to overfit (val loss increases))
Datasets: both old and new versions (old version doesn't consist of
<n> in the target summary)

I don't know what to continue, can someone tell me what my problems are?

patil-suraj · 2021-03-02T15:14:19Z

Hi @thongnguyen050999

See if this comment above helps
#6844 (comment)

nguyentthong · 2021-03-02T17:39:08Z

Hi @patil-suraj,

Yes, I did notice that, these are my results:

Sentence ends with "<n>": ROUGE-1: 45.94, ROUGE-L: 32.24
Sentence ends with "\n": ROUGE-1: 43.96, ROUGE-L: 40.87

nguyentthong · 2021-03-03T05:00:43Z

Are my results reasonable (representing the expected outcome)? :-)

fajri91 · 2021-03-03T09:31:40Z

Are my results reasonable (representing the expected outcome)? :-)

Hi, can you please tell me a bit about what do you want to achieve? and which pre-trained Pegasus model are you currently using? It seems to me you are not doing only inference but some fine-tuning of the Pegasus model (based on your hyperparameter)?

nguyentthong · 2021-03-03T09:49:53Z

Yes, here is my experiment description:

Goal: I want to reproduce the results from the Pegasus paper (in the future I might add some changes based upon the baseline 🧑‍🎓 ), in which I finetuned from the pretrained checkpoint
Pretrained model I use: google/pegasus-large

fajri91 · 2021-03-03T10:40:34Z

I guess google/pegasus-large in huggingface is a Mixed & Stochastic model where we expect to have 44.16/21.56/41.30 (which is slightly lower than your current score).

Have you tried to set the hyperparameter of the original implementation? You can check it here.

The primary hyperparameter will be this:
"max_input_len": 1024, --> (longer text)
"max_output_len": 128,
"train_steps": 210000,
"learning_rate": 0.001,
"batch_size": 8,

You probably want to follow their hyperparameter for inference as well (e.g. beam size etc)

nguyentthong · 2021-03-07T05:22:09Z

Hi @fajri91, I have tried your suggestion and achieved the following results after 210k steps:

Huggingface version:

ROUGE-1 = 43.2011
ROUGE-L = 39.99

Google version (I ran their default code without modifications)

ROUGE-1 = 43.01
ROUGE-L = 39.92

xu1998hz · 2022-05-03T20:53:12Z

Replication

link

mixed & stochastic column of this table

dataset Authors This Repo best bart best bart name
xsum 47.60/24.83/39.64 46.87/24.46/39.15 22.32/37.39 distilbart-xsum-12-6
cnn_dailymail 44.16/21.56/41.30 see comment 21.26/30.59 distilbart-cnn-12-6
newsroom 45.07/33.39/41.28 41.03/29.83/36.96
multi_news 47.65/18.75/24.95 47.58/19.0/24.77
gigaword 39.65/20.47/36.76 39.79/20.56/36.80
wikihow 46.39/22.12/38.41 * 46.85/23.64/28.73
reddit_tifu 27.99/9.81/22.94 32.75/11.68/24.97
big_patent 52.29/33.08/41.66 *
arxiv 44.21/16.95/25.67 44.83/17.34/25.60
pubmed 45.97/20.15/28.25 45.40/19.42/26.93
aeslc 37.68/21.25/36.51 37.09/21.40/35.93
billsum 59.67/41.58/47.59 56.18/39.94/45.39

(* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

Final Update (2020-10-16)

Mission accomplished thanks to the work of @patil-suraj, and @stas00 !

The above table now shows that our results are close enough. We suspect differences are due to treatment of the <n> character that pegasus generates and slightly different beam search implementations.

Link to Spreadsheet with timing data

Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.

Hi Sam, I have a quick question regarding to obtain the results for Gigaword using checkpoint "google/pegasus-gigaword" provided by Google. Currently, I followed a very simple setup using "google/pegasus-gigaword" and follow directly from huggingface default codes in generating gigaword summary. For dataset, I directly load 'gigaword' from datasets library without pre-processing. I currently use rouge_score library to compute the rouge score. However, my results evaluating on 1951 test samples in Gigaword deviates almost 10 rouge points (rouge1, rouge2, rougel: 28, 12 and 25 vs 39.79/20.56/36.80). Is it OK if you can share your setup in reproducing your experiment.

Thanks in advance!

sshleifer changed the title ~~google/pegasus-cnn_dailymail replication~~ Pegasus: replication results Aug 31, 2020

sshleifer self-assigned this Aug 31, 2020

sshleifer added Help wanted Extra attention is needed, help appreciated Replication labels Aug 31, 2020

sshleifer changed the title ~~Pegasus: replication results~~ Pegasus: replication and distillation results Sep 14, 2020

sshleifer mentioned this issue Sep 16, 2020

Pegasus- Arxiv predicts random text #7163

Closed

1 task

This was referenced Oct 7, 2020

Project: Gather summarization datasets and try to replicate pegasus results on them #7647

Closed

Experiment: ROUGE impact of using pegasus length-penalty implementation #6420

Closed

sshleifer closed this as completed Oct 16, 2020

patil-suraj mentioned this issue Apr 20, 2021

google/pegasus-cnn_dailymail generates blank file #11289

Closed

4 tasks

vuiseng9 mentioned this issue Dec 16, 2021

[Benchmark] google/pegasus-wikihow #14804

Closed

stas00 mentioned this issue Dec 17, 2021

[examples/summarization] deal with None in data records #14816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pegasus: replication and distillation results #6844

Pegasus: replication and distillation results #6844

sshleifer commented Aug 31, 2020 •

edited

Loading

sshleifer commented Aug 31, 2020 •

edited

Loading

sshleifer commented Sep 5, 2020 •

edited

Loading

sshleifer commented Sep 14, 2020

sshleifer commented Sep 18, 2020

sshleifer commented Sep 22, 2020 •

edited

Loading

fajri91 commented Sep 26, 2020 •

edited

Loading

sshleifer commented Sep 26, 2020 •

edited

Loading

fajri91 commented Sep 26, 2020 •

edited

Loading

sshleifer commented Oct 2, 2020 •

edited

Loading

sshleifer commented Oct 16, 2020

paulrich1234 commented Jan 15, 2021

yaozhaogoogle commented Feb 19, 2021 •

edited

Loading

nguyentthong commented Mar 2, 2021 •

edited

Loading

patil-suraj commented Mar 2, 2021

nguyentthong commented Mar 2, 2021 •

edited

Loading

nguyentthong commented Mar 3, 2021

fajri91 commented Mar 3, 2021 •

edited

Loading

nguyentthong commented Mar 3, 2021

fajri91 commented Mar 3, 2021

nguyentthong commented Mar 7, 2021 •

edited

Loading

xu1998hz commented May 3, 2022

Replication

Final Update (2020-10-16)

Pegasus: replication and distillation results #6844

Pegasus: replication and distillation results #6844

Comments

sshleifer commented Aug 31, 2020 • edited Loading

Replication

Final Update (2020-10-16)

sshleifer commented Aug 31, 2020 • edited Loading

sshleifer commented Sep 5, 2020 • edited Loading

CNN Dailymail

XSUM

sshleifer commented Sep 14, 2020

sshleifer commented Sep 18, 2020

sshleifer commented Sep 22, 2020 • edited Loading

fajri91 commented Sep 26, 2020 • edited Loading

sshleifer commented Sep 26, 2020 • edited Loading

fajri91 commented Sep 26, 2020 • edited Loading

sshleifer commented Oct 2, 2020 • edited Loading

sshleifer commented Oct 16, 2020

paulrich1234 commented Jan 15, 2021

yaozhaogoogle commented Feb 19, 2021 • edited Loading

nguyentthong commented Mar 2, 2021 • edited Loading

patil-suraj commented Mar 2, 2021

nguyentthong commented Mar 2, 2021 • edited Loading

nguyentthong commented Mar 3, 2021

fajri91 commented Mar 3, 2021 • edited Loading

nguyentthong commented Mar 3, 2021

fajri91 commented Mar 3, 2021

nguyentthong commented Mar 7, 2021 • edited Loading

xu1998hz commented May 3, 2022

Replication

Final Update (2020-10-16)

sshleifer commented Aug 31, 2020 •

edited

Loading

sshleifer commented Aug 31, 2020 •

edited

Loading

sshleifer commented Sep 5, 2020 •

edited

Loading

sshleifer commented Sep 22, 2020 •

edited

Loading

fajri91 commented Sep 26, 2020 •

edited

Loading

sshleifer commented Sep 26, 2020 •

edited

Loading

fajri91 commented Sep 26, 2020 •

edited

Loading

sshleifer commented Oct 2, 2020 •

edited

Loading

yaozhaogoogle commented Feb 19, 2021 •

edited

Loading

nguyentthong commented Mar 2, 2021 •

edited

Loading

nguyentthong commented Mar 2, 2021 •

edited

Loading

fajri91 commented Mar 3, 2021 •

edited

Loading

nguyentthong commented Mar 7, 2021 •

edited

Loading