Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] performance improvement for precompiled normalization #87709

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jun 15, 2022

This is a performance improvement for the precompiled normalizer. It no longer requires the graphemes to be sub-strings and relies on code-point counts between grapheme boundaries for normalization.

@benwtrent benwtrent marked this pull request as ready for review July 8, 2022 17:28
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@@ -144,7 +144,7 @@ Optional<BytesRef> normalizePart(byte[] strBytes, int offset, int len) {
secondIndex++;
}
if (secondIndex == firstIndex) {
return Optional.empty();
return Optional.of(new BytesRef(BytesRef.EMPTY_BYTES));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug, we should return empty bytes here as when the two indices are equal to eachother, it indicates that this part of the string should be removed (think control characters...)

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent requested a review from droberts195 July 11, 2022 13:43
@benwtrent benwtrent requested a review from droberts195 July 12, 2022 14:23
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 9fffd78 into elastic:master Jul 13, 2022
@benwtrent benwtrent deleted the feature/ml-precompiled-normalization-perf-improvement branch July 13, 2022 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >non-issue Team:ML Meta label for the ML team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants