Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Ensure BertTokenizer does not split special tokens #81254

Conversation

dimitris-athanasiou
Copy link
Contributor

This commit changes the way our Bert tokenizer preserves
special tokens without splitting them. The previous approach
split the tokens first on common punctuation, checked if there
were special tokens, and then did further split on all punctuation.
The problem with this was that words containing special tokens and
non-common punctuation were not handled properly.

This commit addresses this by building a trie tree for the special
tokens, splitting the input on all punctuation, and then looking up
the tokens in the special token trie in order to merge matching tokens
back together.

Closes #80484

This commit changes the way our Bert tokenizer preserves
special tokens without splitting them. The previous approach
split the tokens first on common punctuation, checked if there
were special tokens, and then did further split on all punctuation.
The problem with this was that words containing special tokens and
non-common punctuation were not handled properly.

This commit addresses this by building a trie tree for the special
tokens, splitting the input on all punctuation, and then looking up
the tokens in the special token trie in order to merge matching tokens
back together.

Closes elastic#80484
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Dec 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to use the Trie!

I left one suggestion but is LGTM

@@ -166,12 +166,21 @@ public void testPunctuation() {

public void testPunctuationWithMask() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we verify that neversplit works on something like "this should never split"?

Copy link
Contributor Author

@dimitris-athanasiou dimitris-athanasiou Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a never-split token that contains whitespace? If yes, then it doesn't work with that as things are as it's using the same tokenization code as that used for the text input and that splits on whitespace.

@dimitris-athanasiou dimitris-athanasiou merged commit ab4581b into elastic:master Dec 6, 2021
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Dec 6, 2021
)

This commit changes the way our Bert tokenizer preserves
special tokens without splitting them. The previous approach
split the tokens first on common punctuation, checked if there
were special tokens, and then did further split on all punctuation.
The problem with this was that words containing special tokens and
non-common punctuation were not handled properly.

This commit addresses this by building a trie tree for the special
tokens, splitting the input on all punctuation, and then looking up
the tokens in the special token trie in order to merge matching tokens
back together.

Closes elastic#80484
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.0

elasticsearchmachine pushed a commit that referenced this pull request Dec 6, 2021
…) (#81371)

* [ML] Ensure `BertTokenizer` does not split special tokens (#81254)

This commit changes the way our Bert tokenizer preserves
special tokens without splitting them. The previous approach
split the tokens first on common punctuation, checked if there
were special tokens, and then did further split on all punctuation.
The problem with this was that words containing special tokens and
non-common punctuation were not handled properly.

This commit addresses this by building a trie tree for the special
tokens, splitting the input on all punctuation, and then looking up
the tokens in the special token trie in order to merge matching tokens
back together.

Closes #80484

* Fix compilation
@dimitris-athanasiou dimitris-athanasiou deleted the preserve-never-split-tokens branch December 6, 2021 16:26
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Dec 7, 2021
dimitris-athanasiou added a commit that referenced this pull request Dec 7, 2021
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Dec 7, 2021
This fixes a bug introduced by elastic#81254. We are now using
a token trie tree to merge tokens belonging to one of the
never-split tokens back together. However, if the tokenizer
is lower casing, then the merged token will also be lower case
and won't be matched against never split tokens that are expected
to be in upper case.

This commit fixes this by looking up the original text and only
merging tokens together when the original text is matching one
of the never split tokens.
dimitris-athanasiou added a commit that referenced this pull request Dec 7, 2021
This fixes a bug introduced by #81254. We are now using
a token trie tree to merge tokens belonging to one of the
never-split tokens back together. However, if the tokenizer
is lower casing, then the merged token will also be lower case
and won't be matched against never split tokens that are expected
to be in upper case.

This commit fixes this by looking up the original text and only
merging tokens together when the original text is matching one
of the never split tokens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.0.0-rc1 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] improve WordPiece tokenization around punctuation.
6 participants