Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ontonotes_conll dataset #3853

Merged
merged 6 commits into from
Mar 15, 2022
Merged

add ontonotes_conll dataset #3853

merged 6 commits into from
Mar 15, 2022

Conversation

richarddwang
Copy link
Contributor

@richarddwang richarddwang commented Mar 8, 2022

Introduction of the dataset

OntoNotes v5.0 is the final version of OntoNotes corpus, and is a large-scale, multi-genre,
multilingual corpus manually annotated with syntactic, semantic and discourse information.

This dataset is the version of OntoNotes v5.0 extended and used in the CoNLL-2012 shared task
, includes v4 train/dev and v9 test data for English/Chinese/Arabic and corrected version v12 train/dev/test data (English only).

This dataset is widely used in name entity recognition, coreference resolution, and semantic role labeling.

In dataset loading script, I modify and use the code of AllenNLP/Ontonotes to read the special conll files while don't get extra package dependency.

Some workarounds I did

  1. task ids
    I add tasks that I can't find anywhere semantic-role-labeling, lemmatization, and word-sense-disambiguation to the task category structure-prediction, because they are related to "syntax". I feel there is another good name for the task category since some tasks mentioned aren't related to structure, but I have no good idea.

  2. dl_manage.extract
    Since we'll get another zip after unzip the downloaded zip data, I have to use dl_manager.extract directly inside _generate_examples. But when testing dummy data, dl_manager.extract do nothing. So I make a conditional such that it manually extract data when testing dummy data.

Help

Don't know how to fix the building doc error.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! Thanks for adding the dataset :)

The dataset card and dataset script are already really good, thanks ! I left a few comments:

datasets/ontonotesv5_conll2012/README.md Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/README.md Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/README.md Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/README.md Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/README.md Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/ontonotesv5_conll2012.py Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/ontonotesv5_conll2012.py Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/README.md Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/ontonotesv5_conll2012.py Outdated Show resolved Hide resolved
datasets/ontonotesv5_conll2012/ontonotesv5_conll2012.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! Just did some minor changes

datasets/conll2012_ontonotesv5/conll2012_ontonotesv5.py Outdated Show resolved Hide resolved
datasets/conll2012_ontonotesv5/conll2012_ontonotesv5.py Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented Mar 15, 2022

The CI fail is unrelated to this dataset, merging :)

@lhoestq lhoestq merged commit 8f205aa into huggingface:master Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants