[Benchmark] google/pegasus-wikihow #14804

vuiseng9 · 2021-12-16T20:11:39Z

🖥 Benchmarking `transformers`

Benchmark

Which part of transformers did you benchmark?
google/pegasus-wikihow

Set-up

What did you run your benchmarks on? Please include details, such as: CPU, GPU? If using multiple GPUs, which parallelization did you use?
Command below was run with transformer v4.13.0 with single GPU. I tried aligning the input parameters to paper setup.

python run_summarization.py \
    --model_name_or_path google/pegasus-wikihow \
    --dataset_name wikihow \
    --dataset_config all \
    --dataset_dir /data/dataset/wikihow \
    --max_source_length 512 \
    --max_target_length 256 \
    --do_eval \
    --per_device_eval_batch_size 8 \
    --predict_with_generate \
    --num_beams 8 \
    --overwrite_output_dir \
    --run_name $RUNID \
    --output_dir $OUTDIR

Results

Model Card

dataset	C4	HugeNews	Mixed & Stochastic
wikihow	43.07/19.70/34.79	41.35/18.51/33.42	46.39/22.12/38.41 *

According to issue #6844
46.85/23.64/28.73

There was a footnote in the issue - I wonder if any customization needed.
(* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

My results
"eval_rouge1": 33.99,
"eval_rouge2": 13.0781,
"eval_rougeL": 26.5329,

@stas00, @patil-suraj @sshleifer appreciate your pointers!

The text was updated successfully, but these errors were encountered:

stas00 · 2021-12-17T01:22:16Z

I'm not sure what section of the dataset it was eval'ed on so it's hard to tell how to compare the scores, especially if the dataset has grown since it was eval'ed on a year ago.

So first I had to do the following as the dataset contains missing fields:

diff --git a/examples/pytorch/summarization/run_summarization.py b/examples/pytorch/summarization/run_summarization.py
index 658c24114..60da701e6 100755
--- a/examples/pytorch/summarization/run_summarization.py
+++ b/examples/pytorch/summarization/run_summarization.py
@@ -436,8 +436,19 @@ def main():
         )

     def preprocess_function(examples):
-        inputs = examples[text_column]
-        targets = examples[summary_column]
+
+        # remove pairs where at least one record is None
+        inputs, targets = map(
+            list,
+            zip(
+                *(
+                    [examples[text_column][i], examples[summary_column][i]]
+                    for i in range(len(examples[text_column]))
+                    if examples[text_column][i] is not None and examples[summary_column][i] is not None
+                )
+            ),
+        )
+
         inputs = [prefix + inp for inp in inputs]
         model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)

so that takes care of dropping incomplete records.

Now I can run the script normally after manually downloading the csv file with just first 10 records:

python examples/pytorch/summarization/run_summarization.py --model_name_or_path \
google/pegasus-wikihow --max_source_length 512 --max_target_length 256 --do_eval \
--per_device_eval_batch_size 8 --predict_with_generate --num_beams 8 --overwrite_output_dir \
--output_dir output_dir --validation_file data/wikihowAll.csv --text_column text --summary_column \
headline --max_eval_samples 10

we get:

***** eval metrics *****
  eval_gen_len            =       62.2
  eval_loss               =      3.153
  eval_rouge1             =    53.0496
  eval_rouge2             =    30.0482
  eval_rougeL             =     45.322
  eval_rougeLsum          =    45.3855
  eval_runtime            = 0:00:07.14
  eval_samples            =         10
  eval_samples_per_second =      1.399
  eval_steps_per_second   =       0.14

So the score is good. But of course, we want more samples and the right samples. The question is which eval samples did the authors use - you have to use the same samples and then you will be comparing apples to apples.

Until then the results don't lend themselves to a fair comparison, other than knowing that it does summarize as the numbers are relatively high.

Does it make sense?

p.s. Alternatively you could checkout a revision of transformers from when the results were published, and run the script on the same subset and compare the current code with a year-old one - if you get the same score then you know there was no regression in the code over past year.

github-actions · 2022-01-16T15:01:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stas00 mentioned this issue Dec 17, 2021

[examples/summarization] deal with None in data records #14816

Merged

github-actions bot closed this as completed Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] google/pegasus-wikihow #14804

[Benchmark] google/pegasus-wikihow #14804

vuiseng9 commented Dec 16, 2021 •

edited

Loading

stas00 commented Dec 17, 2021 •

edited

Loading

github-actions bot commented Jan 16, 2022

[Benchmark] google/pegasus-wikihow #14804

[Benchmark] google/pegasus-wikihow #14804

Comments

vuiseng9 commented Dec 16, 2021 • edited Loading

🖥 Benchmarking transformers

Benchmark

Set-up

Results

stas00 commented Dec 17, 2021 • edited Loading

github-actions bot commented Jan 16, 2022

vuiseng9 commented Dec 16, 2021 •

edited

Loading

🖥 Benchmarking `transformers`

stas00 commented Dec 17, 2021 •

edited

Loading