Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] google/pegasus-wikihow #14804

Closed
vuiseng9 opened this issue Dec 16, 2021 · 2 comments
Closed

[Benchmark] google/pegasus-wikihow #14804

vuiseng9 opened this issue Dec 16, 2021 · 2 comments

Comments

@vuiseng9
Copy link

vuiseng9 commented Dec 16, 2021

🖥 Benchmarking transformers

Benchmark

Which part of transformers did you benchmark?
google/pegasus-wikihow

Set-up

What did you run your benchmarks on? Please include details, such as: CPU, GPU? If using multiple GPUs, which parallelization did you use?
Command below was run with transformer v4.13.0 with single GPU. I tried aligning the input parameters to paper setup.

python run_summarization.py \
    --model_name_or_path google/pegasus-wikihow \
    --dataset_name wikihow \
    --dataset_config all \
    --dataset_dir /data/dataset/wikihow \
    --max_source_length 512 \
    --max_target_length 256 \
    --do_eval \
    --per_device_eval_batch_size 8 \
    --predict_with_generate \
    --num_beams 8 \
    --overwrite_output_dir \
    --run_name $RUNID \
    --output_dir $OUTDIR

Results

Model Card

dataset C4 HugeNews Mixed & Stochastic
wikihow 43.07/19.70/34.79 41.35/18.51/33.42 46.39/22.12/38.41 *

According to issue #6844
46.85/23.64/28.73

There was a footnote in the issue - I wonder if any customization needed.
(* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

My results
"eval_rouge1": 33.99,
"eval_rouge2": 13.0781,
"eval_rougeL": 26.5329,

@stas00, @patil-suraj @sshleifer appreciate your pointers!

@stas00
Copy link
Contributor

stas00 commented Dec 17, 2021

I'm not sure what section of the dataset it was eval'ed on so it's hard to tell how to compare the scores, especially if the dataset has grown since it was eval'ed on a year ago.

So first I had to do the following as the dataset contains missing fields:

diff --git a/examples/pytorch/summarization/run_summarization.py b/examples/pytorch/summarization/run_summarization.py
index 658c24114..60da701e6 100755
--- a/examples/pytorch/summarization/run_summarization.py
+++ b/examples/pytorch/summarization/run_summarization.py
@@ -436,8 +436,19 @@ def main():
         )

     def preprocess_function(examples):
-        inputs = examples[text_column]
-        targets = examples[summary_column]
+
+        # remove pairs where at least one record is None
+        inputs, targets = map(
+            list,
+            zip(
+                *(
+                    [examples[text_column][i], examples[summary_column][i]]
+                    for i in range(len(examples[text_column]))
+                    if examples[text_column][i] is not None and examples[summary_column][i] is not None
+                )
+            ),
+        )
+
         inputs = [prefix + inp for inp in inputs]
         model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)

so that takes care of dropping incomplete records.

Now I can run the script normally after manually downloading the csv file with just first 10 records:

python examples/pytorch/summarization/run_summarization.py --model_name_or_path \
google/pegasus-wikihow --max_source_length 512 --max_target_length 256 --do_eval \
--per_device_eval_batch_size 8 --predict_with_generate --num_beams 8 --overwrite_output_dir \
--output_dir output_dir --validation_file data/wikihowAll.csv --text_column text --summary_column \
headline --max_eval_samples 10

we get:

***** eval metrics *****
  eval_gen_len            =       62.2
  eval_loss               =      3.153
  eval_rouge1             =    53.0496
  eval_rouge2             =    30.0482
  eval_rougeL             =     45.322
  eval_rougeLsum          =    45.3855
  eval_runtime            = 0:00:07.14
  eval_samples            =         10
  eval_samples_per_second =      1.399
  eval_steps_per_second   =       0.14

So the score is good. But of course, we want more samples and the right samples. The question is which eval samples did the authors use - you have to use the same samples and then you will be comparing apples to apples.

Until then the results don't lend themselves to a fair comparison, other than knowing that it does summarize as the numbers are relatively high.

Does it make sense?

p.s. Alternatively you could checkout a revision of transformers from when the results were published, and run the script on the same subset and compare the current code with a year-old one - if you get the same score then you know there was no regression in the code over past year.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants