unknown words and vocabulary cutoff #3

anoopsarkar · 2014-04-02T17:21:35Z

We should try computing accuracy on the dev set with different count cutoffs for the vocabulary only (i.e. not a count cutoff on features but a count cutoff based on an initial pruning of the vocabulary based on unigram frequency). This will mean more compact models and faster training as well.

A separate idea (which we can come back to later on) is replacing uncommon words is to replace them with the word shape, so that "reprehensible" would get replaced by the token "xxxx", the word "Boeing" would get replaced by "Xxxx", a hyphenated word "ex-accomplice" would get replaced by "Xx-xx", a sequence of 4 digits would get replaced by "YYYY" and a longer number would become "Dddd" and a number with a decimal point becomes "Dd.dd". Some of these heuristics are not that useful since the information is already captured by the POS tag, but it does help in debugging these unknown word features later on.

anoopsarkar modified the milestone: speed up the parser Apr 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unknown words and vocabulary cutoff #3

unknown words and vocabulary cutoff #3

anoopsarkar commented Apr 2, 2014

unknown words and vocabulary cutoff #3

unknown words and vocabulary cutoff #3

Comments

anoopsarkar commented Apr 2, 2014