Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reddit dataset card additions #3781

Merged
merged 3 commits into from
Feb 28, 2022

Conversation

anna-kay
Copy link
Contributor

The changes proposed are based on the "TL;DR: Mining Reddit to Learn Automatic Summarization" paper & https://zenodo.org/record/1043504#.YhaKHpbQC38
It is a Reddit dataset indeed, but the name given to the dataset by the authors is Webis-TLDR-17 (corpus), so perhaps it should be modified as well.
The task at which the dataset is aimed is abstractive summarization.

…name given to the dataset by the authors is Webis-TLDR-17 (corpus), so perhaps it should be modified as well.
@anna-kay anna-kay changed the title Proposed changes Reddit dataset card Feb 24, 2022
@anna-kay anna-kay changed the title Reddit dataset card Reddit dataset card additions Feb 24, 2022
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for completing the dataset card.

I agree we should at least mention "Webis-TLDR-17" in the title of the dataset card and in the Dataset summary section, as well as in the pretty_name tag.

We don't have a redirection/alias mechanism yet, but once we do we can definitely rename this dataset to webis_tldr_17 for example

datasets/reddit/README.md.bak Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

The CI is still failing because some tags are missing:

'annotations_creators', 'language_creators', 'licenses', 'multilinguality', 'size_categories', 'source_datasets', 'task_categories', and 'task_ids'

Let me know if you'd be willing to add the missing tags at the top of the readme in a subsequent PR (you can use this tool) - these tags are helpful for users looking for specific datasets and discoverability

@lhoestq lhoestq merged commit 278db9f into huggingface:master Feb 28, 2022
@anna-kay
Copy link
Contributor Author

Hello! I added the tags and created a PR. Just to note, regarding the paperswithcode_id tag, that currently has the value "reddit"; the dataset described as reddit in paperswithcode is https://paperswithcode.com/dataset/reddit and it isn't the Webis-tldr-17. I could not find Webis-tldr-17 in paperswithcode neither in the Summarization category nor using the keywords reddit, webis, & tldr. I didn't change this tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants