-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added crosslingual coreference to spacy universe without additional commits #10580
Conversation
Hi David, thanks for this really interesting contribution! I was very impressed with the accuracies achieved across a wide range of languages, many of which lack coreference training sets sufficient to train them individually. Languages like Spanish often express anaphora using a verb in the third person without a pronoun; in the examples I tried such anaphora wasn’t picked up, presumably because it doesn’t occur in the ‘basis’ language English. It might make sense to train a model with one such pro-drop language and to use this second model as the basis for other pro-drop languages. If your aim is for the library to be used by a broad range of people as opposed to just demonstrating the idea and if you have time, it might be worth testing some of the edge cases (the document is too long to be processed without memory being exhausted, the document has no coreference clusters) as these led to errors when I tried them out. Best wishes and thank you again, |
@richardpaulhudson It is mind-bending how well it works right?! I was reading up on some research papers that hinted towards cross-lingual effectiveness and thought, lets give it a try. The idea is to share cool and usable stuff with the world, so yes let it be used by as many people as possible. Having said that: maybe its better to wait for the merge until these issues have been resolved. I opened an issue on our git.
|
@richardpaulhudson I added a chunking functionality to overcome OOM errors, and I also resolved the empty document issues. I looked around for decent Coref datasets for (Spanish/Italian), but I couldn't really find any. So, I propose wrapping up this merge and looking into the Pro-drop languages together in the medium/long-term. |
The UD_Spanish-AnCora and UD_Catalan-AnCora are pretty large, but they are in .conllu format. With some effort and reasoning, these could certainly be converted to AllenNLP-readbale .conll. However, as with the Dutch datasets I tried (Corea, Sonar-1), the CoRef annotation guidelines don't seem to align with the OntoNotes 5 guidelines. To have a decent benchmark, I was toying with the idea of translating OntoNotes 5 to Spanish, and letting a bi-lingual (Spanish/English) expert annotate the same CoRef clusters for the translated text. Its not ideal but IMO it beats doing manual annotation for an arbitrary dataset or fixing something that is broken in the first place. |
I'll go ahead and merge this so it can be added to the website. If you'd like to continue the discussion about more general coref issues, a new discussion thread would be a good place to start. |
…ommits (#10580) * added crosslingual coreference to spacy universe * Updated example to introduce batching example. Co-authored-by: David Berenstein <[email protected]>
#10572 #10563 @svlandeg
At Pandora Intelligence, I created another open-source package called
crosslingual-coreference
. It build upon some research hinting towards crosslingual training for coreference, xlm-models, and AllenNLP.Description
I didn't test for all languages within the spaCy models, but it offers decent coref for at least Dutch, Danish and French, which aren't supported by Coreferee as of yet. On top of that, it is also performing well on English and German.
Types of change
documentation
Checklist