-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Benchmark] google/pegasus-wikihow #14804
Comments
I'm not sure what section of the dataset it was eval'ed on so it's hard to tell how to compare the scores, especially if the dataset has grown since it was eval'ed on a year ago. So first I had to do the following as the dataset contains missing fields:
so that takes care of dropping incomplete records. Now I can run the script normally after manually downloading the csv file with just first 10 records:
we get:
So the score is good. But of course, we want more samples and the right samples. The question is which eval samples did the authors use - you have to use the same samples and then you will be comparing apples to apples. Until then the results don't lend themselves to a fair comparison, other than knowing that it does summarize as the numbers are relatively high. Does it make sense? p.s. Alternatively you could checkout a revision of |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
🖥 Benchmarking
transformers
Benchmark
Which part of
transformers
did you benchmark?google/pegasus-wikihow
Set-up
What did you run your benchmarks on? Please include details, such as: CPU, GPU? If using multiple GPUs, which parallelization did you use?
Command below was run with transformer v4.13.0 with single GPU. I tried aligning the input parameters to paper setup.
Results
Model Card
According to issue #6844
46.85/23.64/28.73
There was a footnote in the issue - I wonder if any customization needed.
(* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data
My results
"eval_rouge1": 33.99,
"eval_rouge2": 13.0781,
"eval_rougeL": 26.5329,
@stas00, @patil-suraj @sshleifer appreciate your pointers!
The text was updated successfully, but these errors were encountered: