Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add edit tree lemmatizer #10231

Merged

Conversation

adrianeboyd
Copy link
Contributor

@adrianeboyd adrianeboyd commented Feb 7, 2022

Description

Add edit tree lemmatizer, converted from spacy_experimental.edit_tree_lemmatizer

Types of change

Enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd adrianeboyd requested a review from danieldk February 7, 2022 13:51
@adrianeboyd
Copy link
Contributor Author

@danieldk: In particular, can you double-check all the types?

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_trees.pxd Outdated Show resolved Hide resolved
spacy/tests/pipeline/test_edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
@adrianeboyd
Copy link
Contributor Author

I'm not sure that spacy.pipelines was the best place for edit_trees, but there wasn't an immediately obvious better place to put it.

I'm also not sure that edit_tree_lemmatizer is the name we should go with in the end, since it's kind of opaque/jargon-y for users (I can't think of the actual name for this ... where the name is from the developers' perspective rather than the users').

My original intention was to have this kind of component as a mode of lemmatizer, which has tried to follow the TrainablePipe API enough to make it possible in the future (mainly having a model parameter), but now I think it will be an overall API headache to mix trainable and not-trainable modes.

@danieldk danieldk force-pushed the feature/add-edit-tree-lemmatizer branch from 82c9e50 to 773e1f5 Compare February 7, 2022 16:53
@danieldk
Copy link
Contributor

danieldk commented Feb 7, 2022

I'm also not sure that edit_tree_lemmatizer is the name we should go with in the end, since it's kind of opaque/jargon-y for users (I can't think of the actual name for this ... where the name is from the developers' perspective rather than the users').

Maybe TrainableLemmatizer/trainable_lemmatizer?

@adrianeboyd
Copy link
Contributor Author

So what I mean is the default component name / factory name and not the class name, where I think that EditTreeLemmatizer is fine.

Nothing else is trainable_x and "trainable" is expressed in other ways. But maybe there aren't any better alternatives...

@svlandeg svlandeg added enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Feb 8, 2022
@svlandeg svlandeg added the v3.3 Related to v3.3 label Feb 16, 2022
@svlandeg svlandeg changed the base branch from develop to master February 16, 2022 14:47
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited to have this as part of the main library! During review, I mainly had some small comments about typing.

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
@svlandeg
Copy link
Member

I'm not sure that spacy.pipelines was the best place for edit_trees, but there wasn't an immediately obvious better place to put it.

I was thinking that too. But having spacy.pipeline._edittree_internals like we have for the parser is probably overkill.

For the entity_linker, we did end up deciding having kb.pyx and kb.pxd in the root spacy module though.

@adrianeboyd
Copy link
Contributor Author

Maybe _internals and we move the parser stuff there at some point, too?

@svlandeg
Copy link
Member

Nothing else is trainable_x and "trainable" is expressed in other ways. But maybe there aren't any better alternatives...

trainable_x does make the distinction more clear to users, I think, but I don't like it for the same reason you mention. Alternatively: neural_lemmatizer. Though it has the same issues, at least it's shorter :|

@svlandeg
Copy link
Member

Maybe _internals and we move the parser stuff there at some point, too?

I don't think that'll work, because the internal parser classes are so genericly named that then it might become more difficult to see what all the parser code is...

danieldk and others added 7 commits February 18, 2022 08:25
Co-authored-by: Sofie Van Landeghem <[email protected]>
This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.
@danieldk
Copy link
Contributor

Tested German with the validation/serialization changes, there doesn't seem to be a regression.

Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is ready to merge?

We can revise the cast thing later, it'd require an update of Thinc and I don't want to hold up this PR over something like that.

@adrianeboyd
Copy link
Contributor Author

No, not yet. I would wait a bit until the potential integrations with Lemmatizer have been explored a bit more.

@svlandeg
Copy link
Member

Oh, ok. I thought we didn't want to go there initially, but I'll leave that up to you Adriane. I'll put this in draft to signal that we shouldn't merge yet!

@adrianeboyd
Copy link
Contributor Author

We decided that an integration with Lemmatizer or the lemmatizer factory is not a particular useful choice. It's already supported to do something like this:

nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")

@adrianeboyd
Copy link
Contributor Author

Now this needs docs...

@danieldk
Copy link
Contributor

Now this needs docs...

Added, hope I didn't miss anything.

@danieldk danieldk marked this pull request as ready for review March 21, 2022 13:19
@danieldk danieldk self-requested a review March 21, 2022 14:35
@danieldk danieldk merged commit 85778df into explosion:master Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization v3.3 Related to v3.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants