-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Ensure BertTokenizer
does not split special tokens
#81254
[ML] Ensure BertTokenizer
does not split special tokens
#81254
Conversation
This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to use the Trie!
I left one suggestion but is LGTM
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/TokenTrieNode.java
Outdated
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/BasicTokenizer.java
Show resolved
Hide resolved
@@ -166,12 +166,21 @@ public void testPunctuation() { | |||
|
|||
public void testPunctuationWithMask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we verify that neversplit works on something like "this should never split"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean a never-split token that contains whitespace? If yes, then it doesn't work with that as things are as it's using the same tokenization code as that used for the text input and that splits on whitespace.
) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484
💚 Backport successful
|
…) (#81371) * [ML] Ensure `BertTokenizer` does not split special tokens (#81254) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes #80484 * Fix compilation
…ns (elastic#81254) (elastic#81371)" This reverts commit 9fc8415.
This fixes a bug introduced by elastic#81254. We are now using a token trie tree to merge tokens belonging to one of the never-split tokens back together. However, if the tokenizer is lower casing, then the merged token will also be lower case and won't be matched against never split tokens that are expected to be in upper case. This commit fixes this by looking up the original text and only merging tokens together when the original text is matching one of the never split tokens.
This fixes a bug introduced by #81254. We are now using a token trie tree to merge tokens belonging to one of the never-split tokens back together. However, if the tokenizer is lower casing, then the merged token will also be lower case and won't be matched against never split tokens that are expected to be in upper case. This commit fixes this by looking up the original text and only merging tokens together when the original text is matching one of the never split tokens.
This commit changes the way our Bert tokenizer preserves
special tokens without splitting them. The previous approach
split the tokens first on common punctuation, checked if there
were special tokens, and then did further split on all punctuation.
The problem with this was that words containing special tokens and
non-common punctuation were not handled properly.
This commit addresses this by building a trie tree for the special
tokens, splitting the input on all punctuation, and then looking up
the tokens in the special token trie in order to merge matching tokens
back together.
Closes #80484