-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Cadmium::POSTagger #32
Comments
I don't see any issues with it. I love the idea of being able to generalize it enough to make a classifier. The question is, does that algo work for multiple different languages and character sets? |
Both algos work for multiple languages and multiple character sets provided you give them a specific language model (It's not the job of the POS tagger to identify the language of the text to tag). As reference you can see this implementation (which is a student assignment :-) ) of the Viterbi algo. It's called HMM in this repo, but Viterbi is just a subset of hmm algos ;-) |
Nice! Well I say go for it. You can go ahead and create the repo. |
Thanks ! |
Awesome! |
Preface
As discussed in #31 ,
Cadmium::Lemmatizer
needs aToken
object with POS and morphology data to work properly and be fully tested.The aim of this proposal is to implement a
Cadmium::POSTagger
that will create such aToken
Object for each input string.The first tagging algorithm I'm planning to implement is the Viterbi algorithm.
If I can generalize it enough, the plan is to move it to
Cadmium::Classifier
so it can be used for other objectives.I'm also planning to implement later Dynamic feature induction which could also be used for Named Entity Recognition (and so be moved to
Classifier
)The plan like the
Tokenizer
module or theSummarizer
is to make it possible to choose a specific algorithm instead of being tied to a single one.Details
I propose to implement this with these actions :
cadmiumcr/pos_tagger
repositorycadmiumcr/languages
repository (I'm not sure yet about this one, without knowing the sizes of the models)Token
struct toCadmium::Utils
as it will be used at least by both the POS Tagger and the LemmatizerReferences
List of existing POS Taggers
The text was updated successfully, but these errors were encountered: