-
Notifications
You must be signed in to change notification settings - Fork 0
AboutHiveDummyTagger
Google Code Exporter edited this page Apr 30, 2015
·
1 revision
The DummyTagger
implementation uses a part-of-speech tagger and HIVE Lucene index. The Postagger
class is apparently based on an example from the Ling-Pipe project, uses the lingpipe libraries, and the "medical" POS model (medtag).
- cd C:\mrc\nlp\lingpipe-4.0.1\demos\tutorial\posTags
- ant -Ddata.pos.medpost=c:\mrc\nlp\medtag\medpost train-medpost
- dir ....\models
01/03/2011 09:59 AM 4,974,338 pos-enbio-medpost.HiddenMarkovModel
- The
TrainPostagger
class is taken directly from the lingpipe posTag tutorial.
java -cp build\classes;..\..\..\lingpipe-4.0.0.jar TrainMedPost c:\mrc\nlp\medtag\medpost myModel
From DummyTagger.extractKeyphrases
- Write text to in-memory Lucene index (IndexWriter)
- Generate dictionary (POS/word) using
Postagger
- Read document using Lucene
IndexReader
- Get the term frequency vector
- Get terms and frequencies
- Calculate probability?
- Add term to “Documento” and “Vocabulario”
- Add Document to Colleccio
- Calculate vocabulary probabilities
- Calculate collection divergences?
- For each “document” in the Coleccio
- For each term in document
- If TF > 0.1 and term “isAllowed” (valid part of speech) and term length > 1
- Add term to ranking?
- If TF > 0.1 and term “isAllowed” (valid part of speech) and term length > 1
- For each term in ranking
- Add to keywords
- Return keywords
- For each term in document
- The Postagger loads the HMM for every instantiation.
- Tokenization is slow
- HMM.firstBest is very slow