-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index ids in binary form. #25352
Index ids in binary form. #25352
Conversation
Neat! |
37346c8
to
c859e19
Compare
Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like elastic#24615. Closes elastic#18154
c859e19
to
9d6d81d
Compare
I removed the @s1monw I'd be interested to have your opinion about the approach. |
@jpountz can you list the downsides (if there are any) of this change? |
I don't think there are significant downsides. I mostly want reviews because those are things for which handling bw compat is a PITA so I'd like to get it right as much as possible and reduce chances that we think there is a better way in the next months. This is also why I'd like to fold it into 6.0 so that we do not need to support old ids anymore as soon as 7.0. If your ID is not recognized as a numeric id or base64 id then we will prepend a byte to it which will make it one byte longer. However for numerics and base64 ids it should make things better: it should make numeric ids a bit less than 50% shorter and base64 ids about 33% shorter. I don't think it would add parse-time overhead since the cost of base64 decoding should be the same as the cost of UTF-8 encoding. However the shorter keys might help Lucene since fewer bytes need to be compared upon sorting, which might help both with flushing when we radix sort the ids, and merging when we need to sort ids coming from multiple segments on the fly using a heap. It might also make indices slightly smaller, especially those that index few fields. At the moment, there is no optimization for autogenerated ids to keep things simple, but actually we first generate a binary id then encode it as a string and then decode it again. We might be able to skip the string representation entirely hopefully at some point in the future. One downside that I do not care too much about is that the encoded representation does not preserve order, so I switched fielddata on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some minor nits
BytesRef[] bytesRefs = new BytesRef[values.size()]; | ||
for (int i = 0; i < bytesRefs.length; i++) { | ||
BytesRef id; | ||
if (context.indexVersionCreated().onOrAfter(Version.V_6_0_0_alpha3)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe it's worth to move the if check out of the loop and assign it a boolean? I also wonder why the context is nullable? this looks like asking for trouble here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is old leniency that we removed over time. I believe we reached a state that allows us to remove the nullable annotation everywhere now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
} | ||
break; | ||
default: | ||
throw new AssertionError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please give it some text here it's always better to have a message :)
for (int i = 0; i < length; ++i) { | ||
final char c = id.charAt(i); | ||
final boolean allowed = | ||
(c >= '0' && c <= '9') || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0-9A-Za-z
being consecutive in ASCII would have made things much easier... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhhh yes!!!
@rjernst I merged in order to have time to watch for failures before leaving, but I'm still interested in having your feedback on the encoding of ids! |
* master: (42 commits) Harden global checkpoint tracker Remove deprecated created and found from index, delete and bulk (elastic#25516) fix testEnsureVersionCompatibility for 5.5.0 release fix Version.v6_0_0 min compatibility version to 5.5.0 Add bwc indices for 5.5.0 Add v5_5_1 constant [DOCS] revise high level client Search Scroll API docs (elastic#25599) Improve REST error handling when endpoint does not support HTTP verb, add OPTIONS support (elastic#24437) Avoid SecurityException in repository-S3 on DefaultS3OutputStream.flush() (elastic#25254) [Tests] Add tests for CompletionSuggestionBuilder#build() (elastic#25575) Enable cross-setting validation [Docs] Fix typo in bootstrap-checks.asciidoc (elastic#25597) Index ids in binary form. (elastic#25352) bwc checkout should fetch from all remotes IndexingIT should check for global checkpoints regardless of master version [Tests] Add tests for PhraseSuggestionBuilder#build() (elastic#25571) Remove unused class MinimalMap (elastic#25590) [Docs] Document Scroll API for Java High Level REST Client (elastic#25554) Disable date field mapping changing (elastic#25285) Allow BWC Testing against a specific branch (elastic#25510) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice! Sorry it took so long to look at. I left some ideas for optimization.
// The last 3 symbols (18 bits) are encoding 2 bytes (16 bits) | ||
// so the last symbol only actually uses 16-12=4 bits and can only take 16 values | ||
last = id.charAt(length - 1); | ||
if (last != 'A' && last != 'E' && last != 'I' && last != 'M' && last != 'Q'&& last != 'U'&& last != 'Y' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These conditionals (in case 2 above, and the allowed case below) could be optimized with boolean lookup tables.
for (int i = 0; i < id.length(); i += 2) { | ||
int b1 = id.charAt(i) - '0'; | ||
int b2; | ||
if (i + 1 == id.length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could avoid this conditional by checking for an odd length before the loop, and only iterating through the loop on an even number, then have a single condition after the loop to add the last value in the odd case, followed by the end marker.
final int b1 = (b >>> 4); | ||
final int b2 = b & 0x0f; | ||
chars[(i - 1) * 2] = (char) (b1 + '0'); | ||
if (i == idBytes.length - 1 && b2 == 0x0f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could avoid this conditional by checking the last byte for the end marker before the loop, so that you only iterate over pairs of real numbers, then decode the last value in the odd case after the loop.
private static String decodeBase64Id(byte[] idBytes) { | ||
assert Byte.toUnsignedInt(idBytes[0]) <= BASE64_ESCAPE; | ||
if (Byte.toUnsignedInt(idBytes[0]) == BASE64_ESCAPE) { | ||
idBytes = Arrays.copyOfRange(idBytes, 1, idBytes.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unfortunate Base64.Encoder.encodeToString
only has a byte[]
variant and not a ByteBuffer
version that would allow avoiding the copy...
final int b = Byte.toUnsignedInt(idBytes[i]); | ||
final int b1 = (b >>> 4); | ||
final int b2 = b & 0x0f; | ||
chars[(i - 1) * 2] = (char) (b1 + '0'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe try using two 256 entry lookup tables for the upper and lower values. That would save two additions, the right shift, and cast from byte to int.
Can we do a similar approach when indexing number ids fields, on cases when we only need Either by an option in the mapping (should I open a feature request?). |
This is a first iteration that mostly aims at triggering some discussion.
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.
Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.
Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.
This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like #24615.
Many tests do not pass since they expect to find a string representation of the
id in the index.
Closes #18154