Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added crosslingual coreference to spacy universe without additional commits #10580

Merged
merged 3 commits into from
Apr 8, 2022
Merged

added crosslingual coreference to spacy universe without additional commits #10580

merged 3 commits into from
Apr 8, 2022

Conversation

davidberenstein1957
Copy link
Contributor

#10572 #10563 @svlandeg
At Pandora Intelligence, I created another open-source package called crosslingual-coreference. It build upon some research hinting towards crosslingual training for coreference, xlm-models, and AllenNLP.

Description

I didn't test for all languages within the spaCy models, but it offers decent coref for at least Dutch, Danish and French, which aren't supported by Coreferee as of yet. On top of that, it is also performing well on English and German.

Types of change

documentation

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added the docs Documentation and website label Mar 29, 2022
@richardpaulhudson
Copy link
Contributor

Hi David, thanks for this really interesting contribution! I was very impressed with the accuracies achieved across a wide range of languages, many of which lack coreference training sets sufficient to train them individually.

Languages like Spanish often express anaphora using a verb in the third person without a pronoun; in the examples I tried such anaphora wasn’t picked up, presumably because it doesn’t occur in the ‘basis’ language English. It might make sense to train a model with one such pro-drop language and to use this second model as the basis for other pro-drop languages.

If your aim is for the library to be used by a broad range of people as opposed to just demonstrating the idea and if you have time, it might be worth testing some of the edge cases (the document is too long to be processed without memory being exhausted, the document has no coreference clusters) as these led to errors when I tried them out.

Best wishes and thank you again,
Richard

@davidberenstein1957
Copy link
Contributor Author

davidberenstein1957 commented Mar 30, 2022

@richardpaulhudson It is mind-bending how well it works right?! I was reading up on some research papers that hinted towards cross-lingual effectiveness and thought, lets give it a try.

The idea is to share cool and usable stuff with the world, so yes let it be used by as many people as possible. Having said that: maybe its better to wait for the merge until these issues have been resolved. I opened an issue on our git.

  • We already have a batching function we use internally, which we could add to the package as well. Additionally, we could sacrifice some accuracy in favor of speed by using a distilled-multi-lingual model.
  • The empty document issue can be resolved easily too.
  • Pro-drop languages (Spanish/Italian) training seems to be an issue, since we would need conll formatted and properly annotated data. Maybe you have some data lying around?

@davidberenstein1957
Copy link
Contributor Author

davidberenstein1957 commented Apr 3, 2022

@richardpaulhudson I added a chunking functionality to overcome OOM errors, and I also resolved the empty document issues. I looked around for decent Coref datasets for (Spanish/Italian), but I couldn't really find any. So, I propose wrapping up this merge and looking into the Pro-drop languages together in the medium/long-term.

@davidberenstein1957
Copy link
Contributor Author

davidberenstein1957 commented Apr 5, 2022

The UD_Spanish-AnCora and UD_Catalan-AnCora are pretty large, but they are in .conllu format. With some effort and reasoning, these could certainly be converted to AllenNLP-readbale .conll. However, as with the Dutch datasets I tried (Corea, Sonar-1), the CoRef annotation guidelines don't seem to align with the OntoNotes 5 guidelines.

To have a decent benchmark, I was toying with the idea of translating OntoNotes 5 to Spanish, and letting a bi-lingual (Spanish/English) expert annotate the same CoRef clusters for the translated text. Its not ideal but IMO it beats doing manual annotation for an arbitrary dataset or fixing something that is broken in the first place.

@adrianeboyd
Copy link
Contributor

I'll go ahead and merge this so it can be added to the website. If you'd like to continue the discussion about more general coref issues, a new discussion thread would be a good place to start.

@adrianeboyd adrianeboyd merged commit d4196a6 into explosion:master Apr 8, 2022
adrianeboyd pushed a commit that referenced this pull request Apr 8, 2022
…ommits (#10580)

* added crosslingual coreference to spacy universe

* Updated example to introduce batching example.

Co-authored-by: David Berenstein <[email protected]>
@polm polm added the universe Changes to the Universe directory of third-party spaCy code. label Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation and website universe Changes to the Universe directory of third-party spaCy code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants