add ontonotes_conll dataset #3853

richarddwang · 2022-03-08T08:53:42Z

Introduction of the dataset

OntoNotes v5.0 is the final version of OntoNotes corpus, and is a large-scale, multi-genre,
multilingual corpus manually annotated with syntactic, semantic and discourse information.

This dataset is the version of OntoNotes v5.0 extended and used in the CoNLL-2012 shared task
, includes v4 train/dev and v9 test data for English/Chinese/Arabic and corrected version v12 train/dev/test data (English only).

This dataset is widely used in name entity recognition, coreference resolution, and semantic role labeling.

In dataset loading script, I modify and use the code of AllenNLP/Ontonotes to read the special conll files while don't get extra package dependency.

Some workarounds I did

task ids
I add tasks that I can't find anywhere semantic-role-labeling, lemmatization, and word-sense-disambiguation to the task category structure-prediction, because they are related to "syntax". I feel there is another good name for the task category since some tasks mentioned aren't related to structure, but I have no good idea.
dl_manage.extract
Since we'll get another zip after unzip the downloaded zip data, I have to use dl_manager.extract directly inside _generate_examples. But when testing dummy data, dl_manager.extract do nothing. So I make a conditional such that it manually extract data when testing dummy data.

Help

Don't know how to fix the building doc error.

HuggingFaceDocBuilderDev · 2022-03-08T08:59:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lhoestq

Awesome ! Thanks for adding the dataset :)

The dataset card and dataset script are already really good, thanks ! I left a few comments:

datasets/ontonotesv5_conll2012/README.md

datasets/ontonotesv5_conll2012/ontonotesv5_conll2012.py

datasets/ontonotesv5_conll2012/README.md

datasets/ontonotesv5_conll2012/ontonotesv5_conll2012.py

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq

Thank you ! Just did some minor changes

datasets/conll2012_ontonotesv5/conll2012_ontonotesv5.py

lhoestq · 2022-03-15T10:47:55Z

The CI fail is unrelated to this dataset, merging :)

lhoestq reviewed Mar 10, 2022

View reviewed changes

richarddwang and others added 4 commits March 12, 2022 19:19

add ontonotesv5_conll2012 dataset

341b37a

Apply suggestions from code review

d541118

Co-authored-by: Quentin Lhoest <[email protected]>

rename, fix doc, fix dummy_data

162f7c0

fix flake8

4e1db0c

lhoestq approved these changes Mar 14, 2022

View reviewed changes

datasets/conll2012_ontonotesv5/conll2012_ontonotesv5.py Outdated Show resolved Hide resolved

datasets/conll2012_ontonotesv5/conll2012_ontonotesv5.py Outdated Show resolved Hide resolved

lhoestq added 2 commits March 14, 2022 18:20

Apply suggestions from code review

5702b24

typo

5007055

lhoestq merged commit 8f205aa into huggingface:master Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ontonotes_conll dataset #3853

add ontonotes_conll dataset #3853

richarddwang commented Mar 8, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 8, 2022

lhoestq left a comment

lhoestq left a comment

lhoestq commented Mar 15, 2022

add ontonotes_conll dataset #3853

add ontonotes_conll dataset #3853

Conversation

richarddwang commented Mar 8, 2022 • edited Loading

Introduction of the dataset

Some workarounds I did

Help

HuggingFaceDocBuilderDev commented Mar 8, 2022

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Mar 15, 2022

richarddwang commented Mar 8, 2022 •

edited

Loading