-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] improve WordPiece tokenization around punctuation. #80484
Comments
Pinging @elastic/ml-core (Team:ML) |
dimitris-athanasiou
added a commit
to dimitris-athanasiou/elasticsearch
that referenced
this issue
Dec 2, 2021
This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484
dimitris-athanasiou
added a commit
that referenced
this issue
Dec 6, 2021
This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes #80484
dimitris-athanasiou
added a commit
to dimitris-athanasiou/elasticsearch
that referenced
this issue
Dec 6, 2021
) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484
elasticsearchmachine
pushed a commit
that referenced
this issue
Dec 6, 2021
…) (#81371) * [ML] Ensure `BertTokenizer` does not split special tokens (#81254) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes #80484 * Fix compilation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In certain scenarios, our wordpiece tokenization does not handle punctuation well.
This is [MASK]~tastic!
in this scenario (with the mask next to a non-common punctuation), the tokenizer doesn't recognize the[MASK]
token.The text was updated successfully, but these errors were encountered: