[ML] improve WordPiece tokenization around punctuation. #80484

benwtrent · 2021-11-08T12:37:17Z

In certain scenarios, our wordpiece tokenization does not handle punctuation well.

This is [MASK]~tastic! in this scenario (with the mask next to a non-common punctuation), the tokenizer doesn't recognize the [MASK] token.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-08T12:37:19Z

Pinging @elastic/ml-core (Team:ML)

This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484

This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes #80484

) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes elastic#80484

…) (#81371) * [ML] Ensure `BertTokenizer` does not split special tokens (#81254) This commit changes the way our Bert tokenizer preserves special tokens without splitting them. The previous approach split the tokens first on common punctuation, checked if there were special tokens, and then did further split on all punctuation. The problem with this was that words containing special tokens and non-common punctuation were not handled properly. This commit addresses this by building a trie tree for the special tokens, splitting the input on all punctuation, and then looking up the tokens in the special token trie in order to merge matching tokens back together. Closes #80484 * Fix compilation

benwtrent added >bug :ml Machine learning labels Nov 8, 2021

elasticmachine added the Team:ML Meta label for the ML team label Nov 8, 2021

benwtrent mentioned this issue Nov 8, 2021

[ML] fix tokenization around common punctuation #80451

Merged

dimitris-athanasiou self-assigned this Nov 30, 2021

dimitris-athanasiou mentioned this issue Dec 2, 2021

[ML] Ensure BertTokenizer does not split special tokens #81254

Merged

dimitris-athanasiou closed this as completed in #81254 Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] improve WordPiece tokenization around punctuation. #80484

[ML] improve WordPiece tokenization around punctuation. #80484

benwtrent commented Nov 8, 2021

elasticmachine commented Nov 8, 2021

[ML] improve WordPiece tokenization around punctuation. #80484

[ML] improve WordPiece tokenization around punctuation. #80484

Comments

benwtrent commented Nov 8, 2021

elasticmachine commented Nov 8, 2021