[ML] Ensure `BertTokenizer` does not split special tokens #81254

dimitris-athanasiou · 2021-12-02T14:16:15Z

This commit changes the way our Bert tokenizer preserves
special tokens without splitting them. The previous approach
split the tokens first on common punctuation, checked if there
were special tokens, and then did further split on all punctuation.
The problem with this was that words containing special tokens and
non-common punctuation were not handled properly.

This commit addresses this by building a trie tree for the special
tokens, splitting the input on all punctuation, and then looking up
the tokens in the special token trie in order to merge matching tokens
back together.

Closes #80484

This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484

elasticmachine · 2021-12-02T14:16:19Z

Pinging @elastic/ml-core (Team:ML)

davidkyle

Good idea to use the Trie!

I left one suggestion but is LGTM

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/TokenTrieNode.java

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/BasicTokenizer.java

benwtrent · 2021-12-02T17:58:03Z

...ml/src/test/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/BertTokenizerTests.java

@@ -166,12 +166,21 @@ public void testPunctuation() {

    public void testPunctuationWithMask() {


Could we verify that neversplit works on something like "this should never split"?

You mean a never-split token that contains whitespace? If yes, then it doesn't work with that as things are as it's using the same tokenization code as that used for the text input and that splits on whitespace.

) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484

elasticsearchmachine · 2021-12-06T13:58:12Z

💚 Backport successful

Status	Branch	Result
✅	8.0

…) (#81371) * [ML] Ensure `BertTokenizer` does not split special tokens (#81254) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes #80484 * Fix compilation

…ns (elastic#81254) (elastic#81371)" This reverts commit 9fc8415.

…ns (#81254) (#81371)" (#81422) This reverts commit 9fc8415.

This fixes a bug introduced by elastic#81254. We are now using a token trie tree to merge tokens belonging to one of the never-split tokens back together. However, if the tokenizer is lower casing, then the merged token will also be lower case and won't be matched against never split tokens that are expected to be in upper case. This commit fixes this by looking up the original text and only merging tokens together when the original text is matching one of the never split tokens.

This fixes a bug introduced by #81254. We are now using a token trie tree to merge tokens belonging to one of the never-split tokens back together. However, if the tokenizer is lower casing, then the merged token will also be lower case and won't be matched against never split tokens that are expected to be in upper case. This commit fixes this by looking up the original text and only merging tokens together when the original text is matching one of the never split tokens.

dimitris-athanasiou added >bug :ml Machine learning v8.0.0 v8.1.0 labels Dec 2, 2021

elasticmachine added the Team:ML Meta label for the ML team label Dec 2, 2021

davidkyle approved these changes Dec 2, 2021

View reviewed changes

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/TokenTrieNode.java Outdated Show resolved Hide resolved

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/BasicTokenizer.java Show resolved Hide resolved

benwtrent reviewed Dec 2, 2021

View reviewed changes

Address review comments

4afed3d

dimitris-athanasiou added the auto-backport-and-merge label Dec 6, 2021

benwtrent approved these changes Dec 6, 2021

View reviewed changes

dimitris-athanasiou merged commit ab4581b into elastic:master Dec 6, 2021

dimitris-athanasiou mentioned this pull request Dec 6, 2021

[8.0] [ML] Ensure BertTokenizer does not split special tokens (#81254) #81371

Merged

dimitris-athanasiou deleted the preserve-never-split-tokens branch December 6, 2021 16:26

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Dec 7, 2021

Revert "[8.0] [ML] Ensure BertTokenizer does not split special toke…

119b456

…ns (elastic#81254) (elastic#81371)" This reverts commit 9fc8415.

dimitris-athanasiou mentioned this pull request Dec 7, 2021

Revert "[8.0] [ML] Ensure BertTokenizer does not split special toke… #81422

Merged

dimitris-athanasiou added a commit that referenced this pull request Dec 7, 2021

Revert "[8.0] [ML] Ensure BertTokenizer does not split special toke…

504b86c

…ns (#81254) (#81371)" (#81422) This reverts commit 9fc8415.

dimitris-athanasiou mentioned this pull request Dec 7, 2021

[ML] Preserve casing for never split tokens #81429

Merged

mark-vieira added v8.0.0-rc1 and removed v8.0.0 labels Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Ensure `BertTokenizer` does not split special tokens #81254

[ML] Ensure `BertTokenizer` does not split special tokens #81254

dimitris-athanasiou commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

davidkyle left a comment

benwtrent Dec 2, 2021

dimitris-athanasiou Dec 6, 2021 •

edited

Loading

elasticsearchmachine commented Dec 6, 2021

		@@ -166,12 +166,21 @@ public void testPunctuation() {

		public void testPunctuationWithMask() {

[ML] Ensure BertTokenizer does not split special tokens #81254

[ML] Ensure BertTokenizer does not split special tokens #81254

Conversation

dimitris-athanasiou commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

davidkyle left a comment

Choose a reason for hiding this comment

benwtrent Dec 2, 2021

Choose a reason for hiding this comment

dimitris-athanasiou Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

elasticsearchmachine commented Dec 6, 2021

💚 Backport successful

[ML] Ensure `BertTokenizer` does not split special tokens #81254

[ML] Ensure `BertTokenizer` does not split special tokens #81254

dimitris-athanasiou Dec 6, 2021 •

edited

Loading