-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better sizing BytesRef for Strings in Queries #115655
Better sizing BytesRef for Strings in Queries #115655
Conversation
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Hi @piergm, I've created a changelog YAML for you. |
is this fixing some existing issue? |
@javanna This should delay/avoid OOMs by using less memory when creating BytesRef. |
} else if (obj instanceof CharBuffer v) { | ||
return BytesRefs.checkIndexableLength(new BytesRef(v)); | ||
} else if (obj instanceof BigInteger v) { | ||
return BytesRefs.toBytesRef(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though we could not because of Java, but actually we can and I implemented it here.
Before my change the BytesRef could have length != bytes.length now it's the same and consistent and always <= than previous length that was String#length*3
if (obj instanceof String v) { | ||
byte[] b = new byte[UnicodeUtil.calcUTF16toUTF8Length(v, 0, v.length())]; | ||
UnicodeUtil.UTF16toUTF8(v, 0, v.length(), b); | ||
return BytesRefs.checkIndexableLength(new BytesRef(b, 0, b.length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extract these 3 lines to a separate utility method (maybe on a more appropriate class)? :) This would be very useful for saving non-trivial amounts of heap in other places!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, moved to BytesRefs
and added Java Docs 😄
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
* @return a BytesRef object representing the input string | ||
*/ | ||
public static BytesRef toExactSizedBytesRef(String s) { | ||
byte[] b = new byte[UnicodeUtil.calcUTF16toUTF8Length(s, 0, s.length())]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: could cache s.length()
to a variable for a tiny speedup :P
@elasticmachine update branch |
💔 Backport failedThe backport operation could not be completed due to the following error:
You can use sqren/backport to manually backport by running |
* Better sizing BytesRefs for Strings in Queries * Update docs/changelog/115655.yaml * iter * added test * iter * extracted method * iter --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit 9ebe95a)
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
* Better sizing BytesRef for Strings in Queries (#115655) * Better sizing BytesRefs for Strings in Queries * Update docs/changelog/115655.yaml * iter * added test * iter * extracted method * iter --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit 9ebe95a) * iter
* Better sizing BytesRefs for Strings in Queries * Update docs/changelog/115655.yaml * iter * added test * iter * extracted method * iter --------- Co-authored-by: Elastic Machine <[email protected]>
* Better sizing BytesRefs for Strings in Queries * Update docs/changelog/115655.yaml * iter * added test * iter * extracted method * iter --------- Co-authored-by: Elastic Machine <[email protected]>
* Better sizing BytesRefs for Strings in Queries * Update docs/changelog/115655.yaml * iter * added test * iter * extracted method * iter --------- Co-authored-by: Elastic Machine <[email protected]>
When creating BytesRef with the standard constructor we end up over estimating the size of the byte array (UTF8Size = UTF16Size * 3) in order to avoid parsing the input string to properly calculate UTF8Size from UTF16.
We now instead precisely calculate the length and therefore correctly size the byte array with the result of being slightly slower when parsing but more memory efficient.