Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better sizing BytesRef for Strings in Queries #115655

Merged
merged 11 commits into from
Nov 7, 2024
5 changes: 5 additions & 0 deletions docs/changelog/115655.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 115655
summary: Better sizing `BytesRef` for Strings in Queries
area: Search
type: enhancement
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import org.apache.lucene.search.NamedMatches;
import org.apache.lucene.search.Query;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.UnicodeUtil;
import org.elasticsearch.common.ParsingException;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.io.stream.StreamInput;
Expand Down Expand Up @@ -216,12 +217,14 @@ public final int hashCode() {
* @return the same input object or a {@link BytesRef} representation if input was of type string
*/
static Object maybeConvertToBytesRef(Object obj) {
if (obj instanceof String) {
return BytesRefs.checkIndexableLength(BytesRefs.toBytesRef(obj));
} else if (obj instanceof CharBuffer) {
return BytesRefs.checkIndexableLength(new BytesRef((CharBuffer) obj));
} else if (obj instanceof BigInteger) {
return BytesRefs.toBytesRef(obj);
if (obj instanceof String v) {
byte[] b = new byte[UnicodeUtil.calcUTF16toUTF8Length(v, 0, v.length())];
UnicodeUtil.UTF16toUTF8(v, 0, v.length(), b);
return BytesRefs.checkIndexableLength(new BytesRef(b, 0, b.length));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract these 3 lines to a separate utility method (maybe on a more appropriate class)? :) This would be very useful for saving non-trivial amounts of heap in other places!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, moved to BytesRefs and added Java Docs 😄

} else if (obj instanceof CharBuffer v) {
return BytesRefs.checkIndexableLength(new BytesRef(v));
} else if (obj instanceof BigInteger v) {
return BytesRefs.toBytesRef(v);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test the change?

Copy link
Member Author

@piergm piergm Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though we could not because of Java, but actually we can and I implemented it here.
Before my change the BytesRef could have length != bytes.length now it's the same and consistent and always <= than previous length that was String#length*3

}
return obj;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
package org.elasticsearch.index.query;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.common.ParsingException;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.search.SearchModule;
Expand Down Expand Up @@ -93,4 +94,25 @@ public void testMaybeConvertToBytesRefLongTerm() {
assertThat(e.getMessage(), containsString("term starting with [aaaaa"));
}

public void testMaybeConvertToBytesRefStringCorrectSize() {
int capacity = randomIntBetween(20, 40);
StringBuilder termBuilder = new StringBuilder(capacity);
int correctSize = 0;
for (int i = 0; i < capacity; i++) {
if (i < capacity / 3) {
termBuilder.append((char) randomIntBetween(0, 128));
++correctSize; // use only one byte for char < 128
} else if (i < 2 * capacity / 3) {
termBuilder.append((char) randomIntBetween(128, 2048));
correctSize += 2; // use two bytes for char < 2048
} else {
termBuilder.append((char) randomIntBetween(2048, 4092));
correctSize += 3; // use three bytes for char >= 2048
}
}
BytesRef bytesRef = (BytesRef) AbstractQueryBuilder.maybeConvertToBytesRef(termBuilder.toString());
assertEquals(correctSize, bytesRef.bytes.length);
assertEquals(correctSize, bytesRef.length);
}

}