added crosslingual coreference to spacy universe without additional commits #10580

davidberenstein1957 · 2022-03-29T19:56:11Z

#10572 #10563 @svlandeg
At Pandora Intelligence, I created another open-source package called crosslingual-coreference. It build upon some research hinting towards crosslingual training for coreference, xlm-models, and AllenNLP.

Description

I didn't test for all languages within the spaCy models, but it offers decent coref for at least Dutch, Danish and French, which aren't supported by Coreferee as of yet. On top of that, it is also performing well on English and German.

Types of change

documentation

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

richardpaulhudson · 2022-03-30T08:24:38Z

Hi David, thanks for this really interesting contribution! I was very impressed with the accuracies achieved across a wide range of languages, many of which lack coreference training sets sufficient to train them individually.

Languages like Spanish often express anaphora using a verb in the third person without a pronoun; in the examples I tried such anaphora wasn’t picked up, presumably because it doesn’t occur in the ‘basis’ language English. It might make sense to train a model with one such pro-drop language and to use this second model as the basis for other pro-drop languages.

If your aim is for the library to be used by a broad range of people as opposed to just demonstrating the idea and if you have time, it might be worth testing some of the edge cases (the document is too long to be processed without memory being exhausted, the document has no coreference clusters) as these led to errors when I tried them out.

Best wishes and thank you again,
Richard

davidberenstein1957 · 2022-03-30T10:28:15Z

@richardpaulhudson It is mind-bending how well it works right?! I was reading up on some research papers that hinted towards cross-lingual effectiveness and thought, lets give it a try.

The idea is to share cool and usable stuff with the world, so yes let it be used by as many people as possible. Having said that: maybe its better to wait for the merge until these issues have been resolved. I opened an issue on our git.

We already have a batching function we use internally, which we could add to the package as well. Additionally, we could sacrifice some accuracy in favor of speed by using a distilled-multi-lingual model.
The empty document issue can be resolved easily too.
Pro-drop languages (Spanish/Italian) training seems to be an issue, since we would need conll formatted and properly annotated data. Maybe you have some data lying around?

davidberenstein1957 · 2022-04-03T12:36:12Z

@richardpaulhudson I added a chunking functionality to overcome OOM errors, and I also resolved the empty document issues. I looked around for decent Coref datasets for (Spanish/Italian), but I couldn't really find any. So, I propose wrapping up this merge and looking into the Pro-drop languages together in the medium/long-term.

davidberenstein1957 · 2022-04-05T12:47:08Z

The UD_Spanish-AnCora and UD_Catalan-AnCora are pretty large, but they are in .conllu format. With some effort and reasoning, these could certainly be converted to AllenNLP-readbale .conll. However, as with the Dutch datasets I tried (Corea, Sonar-1), the CoRef annotation guidelines don't seem to align with the OntoNotes 5 guidelines.

To have a decent benchmark, I was toying with the idea of translating OntoNotes 5 to Spanish, and letting a bi-lingual (Spanish/English) expert annotate the same CoRef clusters for the translated text. Its not ideal but IMO it beats doing manual annotation for an arbitrary dataset or fixing something that is broken in the first place.

adrianeboyd · 2022-04-08T06:21:51Z

I'll go ahead and merge this so it can be added to the website. If you'd like to continue the discussion about more general coref issues, a new discussion thread would be a good place to start.

…ommits (#10580) * added crosslingual coreference to spacy universe * Updated example to introduce batching example. Co-authored-by: David Berenstein <[email protected]>

added crosslingual coreference to spacy universe

51ecb2d

svlandeg added the docs Documentation and website label Mar 29, 2022

davidberenstein1957 mentioned this pull request Mar 30, 2022

spaCy issues and suggestions davidberenstein1957/crosslingual-coreference#3

Closed

Updated example to introduce batching example.

7e9cfd6

Merge branch 'explosion:master' into master

454fad6

svlandeg requested a review from richardpaulhudson April 7, 2022 14:28

adrianeboyd merged commit d4196a6 into explosion:master Apr 8, 2022

polm added the universe Changes to the Universe directory of third-party spaCy code. label Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added crosslingual coreference to spacy universe without additional commits #10580

added crosslingual coreference to spacy universe without additional commits #10580

davidberenstein1957 commented Mar 29, 2022

richardpaulhudson commented Mar 30, 2022

davidberenstein1957 commented Mar 30, 2022 •

edited

Loading

davidberenstein1957 commented Apr 3, 2022 •

edited

Loading

davidberenstein1957 commented Apr 5, 2022 •

edited

Loading

adrianeboyd commented Apr 8, 2022

added crosslingual coreference to spacy universe without additional commits #10580

added crosslingual coreference to spacy universe without additional commits #10580

Conversation

davidberenstein1957 commented Mar 29, 2022

Description

Types of change

Checklist

richardpaulhudson commented Mar 30, 2022

davidberenstein1957 commented Mar 30, 2022 • edited Loading

davidberenstein1957 commented Apr 3, 2022 • edited Loading

davidberenstein1957 commented Apr 5, 2022 • edited Loading

adrianeboyd commented Apr 8, 2022

davidberenstein1957 commented Mar 30, 2022 •

edited

Loading

davidberenstein1957 commented Apr 3, 2022 •

edited

Loading

davidberenstein1957 commented Apr 5, 2022 •

edited

Loading