Index ids in binary form. #25352

jpountz · 2017-06-22T13:23:03Z

This is a first iteration that mostly aims at triggering some discussion.
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like #24615.

Many tests do not pass since they expect to find a string representation of the
id in the index.

Closes #18154

nik9000 · 2017-06-22T13:35:09Z

Neat!

Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like elastic#24615. Closes elastic#18154

jpountz · 2017-07-06T13:08:06Z

I removed the WIP label, now that we removed the ability to have multiple types in 6.x indices, I was able to fix the backward compatibility layer and tests.

@s1monw I'd be interested to have your opinion about the approach.
@rjernst I'd like to have your review on the encoding of ids. Can you look at the approach that is taken in Uid.encodeId?

s1monw · 2017-07-07T07:16:17Z

@jpountz can you list the downsides (if there are any) of this change?

jpountz · 2017-07-07T07:34:11Z

I don't think there are significant downsides. I mostly want reviews because those are things for which handling bw compat is a PITA so I'd like to get it right as much as possible and reduce chances that we think there is a better way in the next months. This is also why I'd like to fold it into 6.0 so that we do not need to support old ids anymore as soon as 7.0.

If your ID is not recognized as a numeric id or base64 id then we will prepend a byte to it which will make it one byte longer. However for numerics and base64 ids it should make things better: it should make numeric ids a bit less than 50% shorter and base64 ids about 33% shorter.

I don't think it would add parse-time overhead since the cost of base64 decoding should be the same as the cost of UTF-8 encoding. However the shorter keys might help Lucene since fewer bytes need to be compared upon sorting, which might help both with flushing when we radix sort the ids, and merging when we need to sort ids coming from multiple segments on the fly using a heap. It might also make indices slightly smaller, especially those that index few fields.

At the moment, there is no optimization for autogenerated ids to keep things simple, but actually we first generate a binary id then encode it as a string and then decode it again. We might be able to skip the string representation entirely hopefully at some point in the future.

One downside that I do not care too much about is that the encoded representation does not preserve order, so I switched fielddata on the _id field to expose the binary doc values API rather than sorted. Everything will keep working, it might just be a bit slower. I'd like to remove fielddata support on _id and _uid eventually anyway (#25240) so I don't think this is much of an issue.

s1monw

left some minor nits

s1monw · 2017-07-07T08:12:23Z

core/src/main/java/org/elasticsearch/index/mapper/IdFieldMapper.java

+                BytesRef[] bytesRefs = new BytesRef[values.size()];
+                for (int i = 0; i < bytesRefs.length; i++) {
+                    BytesRef id;
+                    if (context.indexVersionCreated().onOrAfter(Version.V_6_0_0_alpha3)) {


hmm maybe it's worth to move the if check out of the loop and assign it a boolean? I also wonder why the context is nullable? this looks like asking for trouble here?

yeah this is old leniency that we removed over time. I believe we reached a state that allows us to remove the nullable annotation everywhere now

s1monw · 2017-07-07T08:14:26Z

core/src/main/java/org/elasticsearch/index/mapper/Uid.java

+            }
+            break;
+        default:
+            throw new AssertionError();


please give it some text here it's always better to have a message :)

s1monw · 2017-07-07T08:17:55Z

core/src/main/java/org/elasticsearch/index/mapper/Uid.java

+        for (int i = 0; i < length; ++i) {
+            final char c = id.charAt(i);
+            final boolean allowed =
+                    (c >= '0' && c <= '9') ||


0-9A-Za-z being consecutive in ASCII would have made things much easier... ;)

ohhhh yes!!!

jpountz · 2017-07-07T12:23:24Z

@rjernst I merged in order to have time to watch for failures before leaving, but I'm still interested in having your feedback on the encoding of ids!

* master: (42 commits) Harden global checkpoint tracker Remove deprecated created and found from index, delete and bulk (elastic#25516) fix testEnsureVersionCompatibility for 5.5.0 release fix Version.v6_0_0 min compatibility version to 5.5.0 Add bwc indices for 5.5.0 Add v5_5_1 constant [DOCS] revise high level client Search Scroll API docs (elastic#25599) Improve REST error handling when endpoint does not support HTTP verb, add OPTIONS support (elastic#24437) Avoid SecurityException in repository-S3 on DefaultS3OutputStream.flush() (elastic#25254) [Tests] Add tests for CompletionSuggestionBuilder#build() (elastic#25575) Enable cross-setting validation [Docs] Fix typo in bootstrap-checks.asciidoc (elastic#25597) Index ids in binary form. (elastic#25352) bwc checkout should fetch from all remotes IndexingIT should check for global checkpoints regardless of master version [Tests] Add tests for PhraseSuggestionBuilder#build() (elastic#25571) Remove unused class MinimalMap (elastic#25590) [Docs] Document Scroll API for Java High Level REST Client (elastic#25554) Disable date field mapping changing (elastic#25285) Allow BWC Testing against a specific branch (elastic#25510) ...

rjernst

This is nice! Sorry it took so long to look at. I left some ideas for optimization.

rjernst · 2017-07-12T20:32:48Z

core/src/main/java/org/elasticsearch/index/mapper/Uid.java

+            // The last 3 symbols (18 bits) are encoding 2 bytes (16 bits)
+            // so the last symbol only actually uses 16-12=4 bits and can only take 16 values
+            last = id.charAt(length - 1);
+            if (last != 'A' && last != 'E' && last != 'I' && last != 'M' && last != 'Q'&& last != 'U'&& last != 'Y'


These conditionals (in case 2 above, and the allowed case below) could be optimized with boolean lookup tables.

rjernst · 2017-07-12T20:35:33Z

core/src/main/java/org/elasticsearch/index/mapper/Uid.java

+        for (int i = 0; i < id.length(); i += 2) {
+            int b1 = id.charAt(i) - '0';
+            int b2;
+            if (i + 1 == id.length()) {


You could avoid this conditional by checking for an odd length before the loop, and only iterating through the loop on an even number, then have a single condition after the loop to add the last value in the odd case, followed by the end marker.

rjernst · 2017-07-12T20:38:42Z

core/src/main/java/org/elasticsearch/index/mapper/Uid.java

+            final int b1 = (b >>> 4);
+            final int b2 = b & 0x0f;
+            chars[(i - 1) * 2] = (char) (b1 + '0');
+            if (i == idBytes.length - 1 && b2 == 0x0f) {


You could avoid this conditional by checking the last byte for the end marker before the loop, so that you only iterate over pairs of real numbers, then decode the last value in the odd case after the loop.

rjernst · 2017-07-12T20:40:34Z

core/src/main/java/org/elasticsearch/index/mapper/Uid.java

+    private static String decodeBase64Id(byte[] idBytes) {
+        assert Byte.toUnsignedInt(idBytes[0]) <= BASE64_ESCAPE;
+        if (Byte.toUnsignedInt(idBytes[0]) == BASE64_ESCAPE) {
+            idBytes = Arrays.copyOfRange(idBytes, 1, idBytes.length);


It is unfortunate Base64.Encoder.encodeToString only has a byte[] variant and not a ByteBuffer version that would allow avoiding the copy...

rjernst · 2017-07-12T20:44:03Z

core/src/main/java/org/elasticsearch/index/mapper/Uid.java

+            final int b = Byte.toUnsignedInt(idBytes[i]);
+            final int b1 = (b >>> 4);
+            final int b2 = b & 0x0f;
+            chars[(i - 1) * 2] = (char) (b1 + '0');


Maybe try using two 256 entry lookup tables for the upper and lower values. That would save two additions, the right shift, and cast from byte to int.

ddorian · 2018-05-05T09:49:03Z

Can we do a similar approach when indexing number ids fields, on cases when we only need equality operator and we're storing as keyword instead of numeric ?

Either by an option in the mapping (should I open a feature request?).
Or in the client (should I just base64?).

jpountz added >enhancement review WIP labels Jun 22, 2017

jpountz force-pushed the feature/binary_ids branch from 37346c8 to c859e19 Compare July 6, 2017 09:48

jpountz force-pushed the feature/binary_ids branch from c859e19 to 9d6d81d Compare July 6, 2017 12:09

jpountz added v6.0.0 and removed WIP labels Jul 6, 2017

jpountz requested review from s1monw and rjernst July 6, 2017 13:08

s1monw approved these changes Jul 7, 2017

View reviewed changes

iter

3615ce6

jpountz added the :Core/Infra/Core Core issues without another label label Jul 7, 2017

jpountz merged commit 40bb166 into elastic:master Jul 7, 2017

jpountz deleted the feature/binary_ids branch July 7, 2017 12:22

jpountz mentioned this pull request Jul 7, 2017

Optimize the order of bytes in uuids for better compression. #24615

Merged

rjernst reviewed Jul 12, 2017

View reviewed changes

clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

jpountz mentioned this pull request Jul 31, 2017

Number _id on ES 6 with single-type indexes #25957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index ids in binary form. #25352

Index ids in binary form. #25352

jpountz commented Jun 22, 2017

nik9000 commented Jun 22, 2017

jpountz commented Jul 6, 2017

s1monw commented Jul 7, 2017

jpountz commented Jul 7, 2017

s1monw left a comment

s1monw Jul 7, 2017

jpountz Jul 7, 2017

s1monw Jul 7, 2017

s1monw Jul 7, 2017

s1monw Jul 7, 2017

jpountz Jul 7, 2017

jpountz commented Jul 7, 2017

rjernst left a comment

rjernst Jul 12, 2017

rjernst Jul 12, 2017

rjernst Jul 12, 2017

rjernst Jul 12, 2017

rjernst Jul 12, 2017

ddorian commented May 5, 2018

Index ids in binary form. #25352

Index ids in binary form. #25352

Conversation

jpountz commented Jun 22, 2017

nik9000 commented Jun 22, 2017

jpountz commented Jul 6, 2017

s1monw commented Jul 7, 2017

jpountz commented Jul 7, 2017

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Jul 7, 2017

rjernst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddorian commented May 5, 2018