[ML] performance improvement for precompiled normalization #87709

benwtrent · 2022-06-15T18:35:15Z

This is a performance improvement for the precompiled normalizer. It no longer requires the graphemes to be sub-strings and relies on code-point counts between grapheme boundaries for normalization.

elasticmachine · 2022-07-08T17:28:59Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2022-07-08T17:31:00Z

...n/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/PrecompiledCharMapNormalizer.java

@@ -144,7 +144,7 @@ Optional<BytesRef> normalizePart(byte[] strBytes, int offset, int len) {
            secondIndex++;
        }
        if (secondIndex == firstIndex) {
-            return Optional.empty();
+            return Optional.of(new BytesRef(BytesRef.EMPTY_BYTES));


This was a bug, we should return empty bytes here as when the two indices are equal to eachother, it indicates that this part of the string should be removed (think control characters...)

benwtrent · 2022-07-08T18:35:35Z

@elasticmachine update branch

…improvement

...n/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/PrecompiledCharMapNormalizer.java

droberts195

LGTM

[ML] performance improvement for precompiled normalization

40e081e

elasticsearchmachine added the v8.4.0 label Jun 15, 2022

benwtrent added >non-issue :ml Machine learning labels Jun 21, 2022

fixing tokenization

3165403

benwtrent marked this pull request as ready for review July 8, 2022 17:28

elasticmachine added the Team:ML Meta label for the ML team label Jul 8, 2022

benwtrent commented Jul 8, 2022

View reviewed changes

revert test change

875e4bf

Merge branch 'master' into feature/ml-precompiled-normalization-perf-…

a9f7ac2

…improvement

benwtrent requested a review from droberts195 July 11, 2022 13:43

droberts195 reviewed Jul 11, 2022

View reviewed changes

...n/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/PrecompiledCharMapNormalizer.java Outdated Show resolved Hide resolved

...n/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/PrecompiledCharMapNormalizer.java Outdated Show resolved Hide resolved

address pr comments

8267595

benwtrent requested a review from droberts195 July 12, 2022 14:23

benwtrent commented Jul 13, 2022

View reviewed changes

...n/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/PrecompiledCharMapNormalizer.java Show resolved Hide resolved

droberts195 approved these changes Jul 13, 2022

View reviewed changes

benwtrent merged commit 9fffd78 into elastic:master Jul 13, 2022

benwtrent deleted the feature/ml-precompiled-normalization-perf-improvement branch July 13, 2022 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] performance improvement for precompiled normalization #87709

[ML] performance improvement for precompiled normalization #87709

benwtrent commented Jun 15, 2022 •

edited

Loading

elasticmachine commented Jul 8, 2022

benwtrent Jul 8, 2022

benwtrent commented Jul 8, 2022

droberts195 left a comment

[ML] performance improvement for precompiled normalization #87709

[ML] performance improvement for precompiled normalization #87709

Conversation

benwtrent commented Jun 15, 2022 • edited Loading

elasticmachine commented Jul 8, 2022

benwtrent Jul 8, 2022

Choose a reason for hiding this comment

benwtrent commented Jul 8, 2022

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Jun 15, 2022 •

edited

Loading