-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ontonotes_conll dataset #3853
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! Thanks for adding the dataset :)
The dataset card and dataset script are already really good, thanks ! I left a few comments:
Co-authored-by: Quentin Lhoest <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ! Just did some minor changes
The CI fail is unrelated to this dataset, merging :) |
Introduction of the dataset
OntoNotes v5.0 is the final version of OntoNotes corpus, and is a large-scale, multi-genre,
multilingual corpus manually annotated with syntactic, semantic and discourse information.
This dataset is the version of OntoNotes v5.0 extended and used in the CoNLL-2012 shared task
, includes v4 train/dev and v9 test data for English/Chinese/Arabic and corrected version v12 train/dev/test data (English only).
This dataset is widely used in name entity recognition, coreference resolution, and semantic role labeling.
In dataset loading script, I modify and use the code of AllenNLP/Ontonotes to read the special conll files while don't get extra package dependency.
Some workarounds I did
task ids
I add tasks that I can't find anywhere
semantic-role-labeling
,lemmatization
, andword-sense-disambiguation
to the task categorystructure-prediction
, because they are related to "syntax". I feel there is another good name for the task category since some tasks mentioned aren't related to structure, but I have no good idea.dl_manage.extract
Since we'll get another zip after unzip the downloaded zip data, I have to use
dl_manager.extract
directly inside_generate_examples
. But when testing dummy data,dl_manager.extract
do nothing. So I make a conditional such that it manually extract data when testing dummy data.Help
Don't know how to fix the building doc error.