Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Cadmium::POSTagger #32

Open
rmarronnier opened this issue Sep 13, 2019 · 5 comments
Open

Proposal: Cadmium::POSTagger #32

rmarronnier opened this issue Sep 13, 2019 · 5 comments
Labels
enhancement New feature or request in progress

Comments

@rmarronnier
Copy link
Member

Preface

As discussed in #31 , Cadmium::Lemmatizer needs a Token object with POS and morphology data to work properly and be fully tested.
The aim of this proposal is to implement a Cadmium::POSTagger that will create such a Token Object for each input string.
The first tagging algorithm I'm planning to implement is the Viterbi algorithm.
If I can generalize it enough, the plan is to move it to Cadmium::Classifier so it can be used for other objectives.
I'm also planning to implement later Dynamic feature induction which could also be used for Named Entity Recognition (and so be moved to Classifier)
The plan like the Tokenizer module or the Summarizer is to make it possible to choose a specific algorithm instead of being tied to a single one.

Details

I propose to implement this with these actions :

  • Create a cadmiumcr/pos_tagger repository
  • Implement a POC POS tagger with the Viterbi algorithm
  • If the algorithm can be generalized (ie not too specific to POS tagging) move it to Cadmium::Classifier::Viterbi
  • Move the working POS Tagger to its repository along with english tagging data
  • Push other languages data to the cadmiumcr/languages repository (I'm not sure yet about this one, without knowing the sizes of the models)
  • Move the Token struct to Cadmium::Utils as it will be used at least by both the POS Tagger and the Lemmatizer

References

List of existing POS Taggers

@watzon
Copy link
Member

watzon commented Sep 13, 2019

I don't see any issues with it. I love the idea of being able to generalize it enough to make a classifier. The question is, does that algo work for multiple different languages and character sets?

@rmarronnier
Copy link
Member Author

Both algos work for multiple languages and multiple character sets provided you give them a specific language model (It's not the job of the POS tagger to identify the language of the text to tag).

As reference you can see this implementation (which is a student assignment :-) ) of the Viterbi algo. It's called HMM in this repo, but Viterbi is just a subset of hmm algos ;-)

@watzon
Copy link
Member

watzon commented Sep 13, 2019

Nice! Well I say go for it. You can go ahead and create the repo.

@rmarronnier
Copy link
Member Author

Thanks !
I realize I've not fully answered your question. When I say language specific model, I mean a language specific set of POS tags and the trained model.
We can map the language specific POS tags with universal POS tags like what spacy (again ! :-p) does.

@watzon
Copy link
Member

watzon commented Sep 13, 2019

Awesome!

@watzon watzon transferred this issue from another repository Nov 7, 2019
@watzon watzon added enhancement New feature or request in progress labels Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in progress
Projects
None yet
Development

No branches or pull requests

2 participants