Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unknown words and vocabulary cutoff #3

Open
anoopsarkar opened this issue Apr 2, 2014 · 0 comments
Open

unknown words and vocabulary cutoff #3

anoopsarkar opened this issue Apr 2, 2014 · 0 comments

Comments

@anoopsarkar
Copy link
Member

We should try computing accuracy on the dev set with different count cutoffs for the vocabulary only (i.e. not a count cutoff on features but a count cutoff based on an initial pruning of the vocabulary based on unigram frequency). This will mean more compact models and faster training as well.

A separate idea (which we can come back to later on) is replacing uncommon words is to replace them with the word shape, so that "reprehensible" would get replaced by the token "xxxx", the word "Boeing" would get replaced by "Xxxx", a hyphenated word "ex-accomplice" would get replaced by "Xx-xx", a sequence of 4 digits would get replaced by "YYYY" and a longer number would become "Dddd" and a number with a decimal point becomes "Dd.dd". Some of these heuristics are not that useful since the information is already captured by the POS tag, but it does help in debugging these unknown word features later on.

@anoopsarkar anoopsarkar modified the milestone: speed up the parser Apr 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant