Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index ids in binary form. #25352

Merged
merged 2 commits into from
Jul 7, 2017
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 22, 2017

This is a first iteration that mostly aims at triggering some discussion.
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like #24615.

Many tests do not pass since they expect to find a string representation of the
id in the index.

Closes #18154

@nik9000
Copy link
Member

nik9000 commented Jun 22, 2017

Neat!

@jpountz jpountz force-pushed the feature/binary_ids branch from 37346c8 to c859e19 Compare July 6, 2017 09:48
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like elastic#24615.

Closes elastic#18154
@jpountz jpountz force-pushed the feature/binary_ids branch from c859e19 to 9d6d81d Compare July 6, 2017 12:09
@jpountz jpountz added v6.0.0 and removed WIP labels Jul 6, 2017
@jpountz
Copy link
Contributor Author

jpountz commented Jul 6, 2017

I removed the WIP label, now that we removed the ability to have multiple types in 6.x indices, I was able to fix the backward compatibility layer and tests.

@s1monw I'd be interested to have your opinion about the approach.
@rjernst I'd like to have your review on the encoding of ids. Can you look at the approach that is taken in Uid.encodeId?

@jpountz jpountz requested review from s1monw and rjernst July 6, 2017 13:08
@s1monw
Copy link
Contributor

s1monw commented Jul 7, 2017

@jpountz can you list the downsides (if there are any) of this change?

@jpountz
Copy link
Contributor Author

jpountz commented Jul 7, 2017

I don't think there are significant downsides. I mostly want reviews because those are things for which handling bw compat is a PITA so I'd like to get it right as much as possible and reduce chances that we think there is a better way in the next months. This is also why I'd like to fold it into 6.0 so that we do not need to support old ids anymore as soon as 7.0.

If your ID is not recognized as a numeric id or base64 id then we will prepend a byte to it which will make it one byte longer. However for numerics and base64 ids it should make things better: it should make numeric ids a bit less than 50% shorter and base64 ids about 33% shorter.

I don't think it would add parse-time overhead since the cost of base64 decoding should be the same as the cost of UTF-8 encoding. However the shorter keys might help Lucene since fewer bytes need to be compared upon sorting, which might help both with flushing when we radix sort the ids, and merging when we need to sort ids coming from multiple segments on the fly using a heap. It might also make indices slightly smaller, especially those that index few fields.

At the moment, there is no optimization for autogenerated ids to keep things simple, but actually we first generate a binary id then encode it as a string and then decode it again. We might be able to skip the string representation entirely hopefully at some point in the future.

One downside that I do not care too much about is that the encoded representation does not preserve order, so I switched fielddata on the _id field to expose the binary doc values API rather than sorted. Everything will keep working, it might just be a bit slower. I'd like to remove fielddata support on _id and _uid eventually anyway (#25240) so I don't think this is much of an issue.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some minor nits

BytesRef[] bytesRefs = new BytesRef[values.size()];
for (int i = 0; i < bytesRefs.length; i++) {
BytesRef id;
if (context.indexVersionCreated().onOrAfter(Version.V_6_0_0_alpha3)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe it's worth to move the if check out of the loop and assign it a boolean? I also wonder why the context is nullable? this looks like asking for trouble here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is old leniency that we removed over time. I believe we reached a state that allows us to remove the nullable annotation everywhere now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}
break;
default:
throw new AssertionError();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please give it some text here it's always better to have a message :)

for (int i = 0; i < length; ++i) {
final char c = id.charAt(i);
final boolean allowed =
(c >= '0' && c <= '9') ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0-9A-Za-z being consecutive in ASCII would have made things much easier... ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhhh yes!!!

@jpountz jpountz added the :Core/Infra/Core Core issues without another label label Jul 7, 2017
@jpountz jpountz merged commit 40bb166 into elastic:master Jul 7, 2017
@jpountz jpountz deleted the feature/binary_ids branch July 7, 2017 12:22
@jpountz
Copy link
Contributor Author

jpountz commented Jul 7, 2017

@rjernst I merged in order to have time to watch for failures before leaving, but I'm still interested in having your feedback on the encoding of ids!

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 7, 2017
* master: (42 commits)
  Harden global checkpoint tracker
  Remove deprecated created and found from index, delete and bulk (elastic#25516)
  fix testEnsureVersionCompatibility for 5.5.0 release
  fix Version.v6_0_0 min compatibility version to 5.5.0
  Add bwc indices for 5.5.0
  Add v5_5_1 constant
  [DOCS] revise high level client Search Scroll API docs (elastic#25599)
  Improve REST error handling when endpoint does not support HTTP verb, add OPTIONS support (elastic#24437)
  Avoid SecurityException in repository-S3 on DefaultS3OutputStream.flush() (elastic#25254)
  [Tests] Add tests for CompletionSuggestionBuilder#build() (elastic#25575)
  Enable cross-setting validation
  [Docs] Fix typo in bootstrap-checks.asciidoc (elastic#25597)
  Index ids in binary form. (elastic#25352)
  bwc checkout should fetch from all remotes
  IndexingIT should check for global checkpoints regardless of master version
  [Tests] Add tests for PhraseSuggestionBuilder#build() (elastic#25571)
  Remove unused class MinimalMap (elastic#25590)
  [Docs] Document Scroll API for Java High Level REST Client (elastic#25554)
  Disable date field mapping changing (elastic#25285)
  Allow BWC Testing against a specific branch (elastic#25510)
  ...
Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Sorry it took so long to look at. I left some ideas for optimization.

// The last 3 symbols (18 bits) are encoding 2 bytes (16 bits)
// so the last symbol only actually uses 16-12=4 bits and can only take 16 values
last = id.charAt(length - 1);
if (last != 'A' && last != 'E' && last != 'I' && last != 'M' && last != 'Q'&& last != 'U'&& last != 'Y'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These conditionals (in case 2 above, and the allowed case below) could be optimized with boolean lookup tables.

for (int i = 0; i < id.length(); i += 2) {
int b1 = id.charAt(i) - '0';
int b2;
if (i + 1 == id.length()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid this conditional by checking for an odd length before the loop, and only iterating through the loop on an even number, then have a single condition after the loop to add the last value in the odd case, followed by the end marker.

final int b1 = (b >>> 4);
final int b2 = b & 0x0f;
chars[(i - 1) * 2] = (char) (b1 + '0');
if (i == idBytes.length - 1 && b2 == 0x0f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid this conditional by checking the last byte for the end marker before the loop, so that you only iterate over pairs of real numbers, then decode the last value in the odd case after the loop.

private static String decodeBase64Id(byte[] idBytes) {
assert Byte.toUnsignedInt(idBytes[0]) <= BASE64_ESCAPE;
if (Byte.toUnsignedInt(idBytes[0]) == BASE64_ESCAPE) {
idBytes = Arrays.copyOfRange(idBytes, 1, idBytes.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate Base64.Encoder.encodeToString only has a byte[] variant and not a ByteBuffer version that would allow avoiding the copy...

final int b = Byte.toUnsignedInt(idBytes[i]);
final int b1 = (b >>> 4);
final int b2 = b & 0x0f;
chars[(i - 1) * 2] = (char) (b1 + '0');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe try using two 256 entry lookup tables for the upper and lower values. That would save two additions, the right shift, and cast from byte to int.

@ddorian
Copy link

ddorian commented May 5, 2018

Can we do a similar approach when indexing number ids fields, on cases when we only need equality operator and we're storing as keyword instead of numeric ?

Either by an option in the mapping (should I open a feature request?).
Or in the client (should I just base64?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >enhancement v6.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants