Consider storing _id through doc values. #60778

jtibshirani · 2020-08-05T17:48:40Z

I wanted to revisit the idea of storing _id as a doc value field. To avoid duplicating data, we would also stop storing _id as a stored field. During the fetch phase, _id would be retrieved from doc values instead of stored fields as it is now.

We previously discussed this in #11887, but the trade-offs may be different now that we have compression for binary doc values. @jpountz recently ran an experiment that showed switching _id from a stored to binary doc value field didn't increase index size.

Some advantages to having doc values for _id:

Sorting and aggregating on _id would work without loading on-heap 'fielddata'.
Some search workflows retrieve a large number of docs and only consult the _id, but do not load detailed data like _source. These searches could be more efficient, since we would no longer decompress all stored fields (including _source) just to retrieve _id.

One question I have is whether sorting on _id would still be useful after we introduce search contexts, with a built-in tiebreaker (#56828). And is there a use case for aggregating on _id?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-05T17:48:42Z

Pinging @elastic/es-search (:Search/Search)

jpountz · 2020-08-22T06:22:56Z

There may be connections with #48699. If we dropped ids from indices it would be problematic to rely on them for tie-breaking. On the other hand, removing ids from an index would be more efficient if they were stored in a doc-value field as it would avoid having to decompress and then compress again all stored fields.

jtibshirani · 2020-11-02T23:23:28Z

Some notes from our team discussion:

We reiterated that there doesn't seem to be strong use cases for sorting/ aggregating on _id, especially after we introduce a dedicated tiebreaker. We'd like to completely remove support for fielddata on _id (Remove support for fielddata loading on _id. #64511), with no dependency on this issue.
Being able to load _id without decompressing _source may not be a compelling enough reason to make the change.

I'll leave this open for a bit longer, but will close if there's no more interest or feedback.

elasticmachine · 2020-11-02T23:23:43Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

davishmcclurg · 2020-11-02T23:38:40Z

I just had to make a change in enterprise search related to this. We're working on changing how we store documents and I was hoping to not have to copy _id into another field. We allow sorting on ids, though, which caused errors in Elasticsearch 8 when sorting on _id.

jtibshirani · 2020-11-03T00:02:23Z

@davishmcclurg this is helpful feedback. For better context, could you describe some use cases where users want to sort on the _id field?

davishmcclurg · 2020-11-03T01:13:53Z

For better context, could you describe some use cases where users want to sort on the _id field?

Internally, we sort on ID (using search_after) to scan through documents. Our users might also find it useful to have a stable sort field to page through the documents they've indexed in app/workplace search. I don't know that we purposefully exposed the ability to sort by ID, but we still need it for backwards compatibility.

etki · 2020-12-26T15:27:27Z

Just adding a necrocomment: from what i remember, this has already been discussed and reasoning behind not adding doc_values was that cardinality is as high as number of documents, while doc_values expect something more selective.

This may be currently a complete lie since that discussion was from years ago and my memory isn't 100% reliable.

nik9000 · 2022-05-04T17:21:17Z

I talked to @jtibshirani and she's ok closing this as something that we're not going to have time to do soon. So I'll do that. But! TSDB (#74660) is not going to store _id at all at some point. It'll reconstruct the _id on the fly from the much lower cardinality _tsid and @timestamp fields. They still have fairly high cardinality, but much much much lower than tsdb. We don't have plans to do anything like this in the short term for non-tsdb indices.

mitar · 2022-10-04T19:53:09Z

I am sad to see this being closed. My use cases (in the same system) are exactly the one described above:

I care only about search result _id and no other field. Our system returns that to the client and then client fetches documents through other means (one of reasons for this is that indexing for ElasticSearch generally requires some transformations to the document to be able to map fields better, while our original document structure is easier to work with elsewhere). I currently use _source: false in search queries to suppress sending _source over the wire, but if there are ways to make ES not even load _source internally (just to get _id), that is even better.
I am also interested in using _id for sorting, but just for breaking ties when score is equal between multiple documents. I want to predictably but randomly sort such results. So I am using random_score scoring function on _id field with random seed tied to the user (so that user has same ordering even if they redo the query). This is being deprecated so having an alternative for this would be great. (Without having to copy the _id just to support this edge case.)

mitar · 2022-10-04T20:33:24Z

One option could be that one could optionally enable doc_values on _id if you wanted to do so and sort on it.

jtibshirani added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Aug 5, 2020

elasticmachine added the Team:Search Meta label for search team label Aug 5, 2020

jtibshirani added the team-discuss label Aug 11, 2020

jtibshirani removed the team-discuss label Oct 30, 2020

jtibshirani added the :Analytics/Aggregations Aggregations label Nov 2, 2020

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 2, 2020

jtibshirani mentioned this issue Nov 4, 2020

Remove on-heap fielddata. #64612

Closed

wchaparro added the team-discuss label Mar 21, 2022

nik9000 closed this as completed May 4, 2022

mitar mentioned this issue Oct 4, 2022

Remove support for fielddata loading on _id. #64511

Open

weltenwort mentioned this issue Aug 9, 2023

[OnWeek][Discover] Allow to fetch more documents on Discover page elastic/kibana#157241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider storing _id through doc values. #60778

Consider storing _id through doc values. #60778

jtibshirani commented Aug 5, 2020 •

edited

Loading

elasticmachine commented Aug 5, 2020

jpountz commented Aug 22, 2020

jtibshirani commented Nov 2, 2020

elasticmachine commented Nov 2, 2020

davishmcclurg commented Nov 2, 2020

jtibshirani commented Nov 3, 2020

davishmcclurg commented Nov 3, 2020

etki commented Dec 26, 2020

nik9000 commented May 4, 2022

mitar commented Oct 4, 2022

mitar commented Oct 4, 2022

Consider storing _id through doc values. #60778

Consider storing _id through doc values. #60778

Comments

jtibshirani commented Aug 5, 2020 • edited Loading

elasticmachine commented Aug 5, 2020

jpountz commented Aug 22, 2020

jtibshirani commented Nov 2, 2020

elasticmachine commented Nov 2, 2020

davishmcclurg commented Nov 2, 2020

jtibshirani commented Nov 3, 2020

davishmcclurg commented Nov 3, 2020

etki commented Dec 26, 2020

nik9000 commented May 4, 2022

mitar commented Oct 4, 2022

mitar commented Oct 4, 2022

jtibshirani commented Aug 5, 2020 •

edited

Loading