You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should try computing accuracy on the dev set with different count cutoffs for the vocabulary only (i.e. not a count cutoff on features but a count cutoff based on an initial pruning of the vocabulary based on unigram frequency). This will mean more compact models and faster training as well.
A separate idea (which we can come back to later on) is replacing uncommon words is to replace them with the word shape, so that "reprehensible" would get replaced by the token "xxxx", the word "Boeing" would get replaced by "Xxxx", a hyphenated word "ex-accomplice" would get replaced by "Xx-xx", a sequence of 4 digits would get replaced by "YYYY" and a longer number would become "Dddd" and a number with a decimal point becomes "Dd.dd". Some of these heuristics are not that useful since the information is already captured by the POS tag, but it does help in debugging these unknown word features later on.
The text was updated successfully, but these errors were encountered:
We should try computing accuracy on the dev set with different count cutoffs for the vocabulary only (i.e. not a count cutoff on features but a count cutoff based on an initial pruning of the vocabulary based on unigram frequency). This will mean more compact models and faster training as well.
A separate idea (which we can come back to later on) is replacing uncommon words is to replace them with the word shape, so that "reprehensible" would get replaced by the token "xxxx", the word "Boeing" would get replaced by "Xxxx", a hyphenated word "ex-accomplice" would get replaced by "Xx-xx", a sequence of 4 digits would get replaced by "YYYY" and a longer number would become "Dddd" and a number with a decimal point becomes "Dd.dd". Some of these heuristics are not that useful since the information is already captured by the POS tag, but it does help in debugging these unknown word features later on.
The text was updated successfully, but these errors were encountered: