Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider storing _id through doc values. #60778

Closed
jtibshirani opened this issue Aug 5, 2020 · 11 comments
Closed

Consider storing _id through doc values. #60778

jtibshirani opened this issue Aug 5, 2020 · 11 comments
Labels
:Analytics/Aggregations Aggregations >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team team-discuss

Comments

@jtibshirani
Copy link
Contributor

jtibshirani commented Aug 5, 2020

I wanted to revisit the idea of storing _id as a doc value field. To avoid duplicating data, we would also stop storing _id as a stored field. During the fetch phase, _id would be retrieved from doc values instead of stored fields as it is now.

We previously discussed this in #11887, but the trade-offs may be different now that we have compression for binary doc values. @jpountz recently ran an experiment that showed switching _id from a stored to binary doc value field didn't increase index size.

Some advantages to having doc values for _id:

  • Sorting and aggregating on _id would work without loading on-heap 'fielddata'.
  • Some search workflows retrieve a large number of docs and only consult the _id, but do not load detailed data like _source. These searches could be more efficient, since we would no longer decompress all stored fields (including _source) just to retrieve _id.

One question I have is whether sorting on _id would still be useful after we introduce search contexts, with a built-in tiebreaker (#56828). And is there a use case for aggregating on _id?

@jtibshirani jtibshirani added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Aug 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Aug 5, 2020
@jpountz
Copy link
Contributor

jpountz commented Aug 22, 2020

There may be connections with #48699. If we dropped ids from indices it would be problematic to rely on them for tie-breaking. On the other hand, removing ids from an index would be more efficient if they were stored in a doc-value field as it would avoid having to decompress and then compress again all stored fields.

@jtibshirani
Copy link
Contributor Author

Some notes from our team discussion:

  • We reiterated that there doesn't seem to be strong use cases for sorting/ aggregating on _id, especially after we introduce a dedicated tiebreaker. We'd like to completely remove support for fielddata on _id (Remove support for fielddata loading on _id. #64511), with no dependency on this issue.
  • Being able to load _id without decompressing _source may not be a compelling enough reason to make the change.

I'll leave this open for a bit longer, but will close if there's no more interest or feedback.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 2, 2020
@davishmcclurg
Copy link

I just had to make a change in enterprise search related to this. We're working on changing how we store documents and I was hoping to not have to copy _id into another field. We allow sorting on ids, though, which caused errors in Elasticsearch 8 when sorting on _id.

@jtibshirani
Copy link
Contributor Author

@davishmcclurg this is helpful feedback. For better context, could you describe some use cases where users want to sort on the _id field?

@davishmcclurg
Copy link

For better context, could you describe some use cases where users want to sort on the _id field?

Internally, we sort on ID (using search_after) to scan through documents. Our users might also find it useful to have a stable sort field to page through the documents they've indexed in app/workplace search. I don't know that we purposefully exposed the ability to sort by ID, but we still need it for backwards compatibility.

@etki
Copy link
Contributor

etki commented Dec 26, 2020

Just adding a necrocomment: from what i remember, this has already been discussed and reasoning behind not adding doc_values was that cardinality is as high as number of documents, while doc_values expect something more selective.

This may be currently a complete lie since that discussion was from years ago and my memory isn't 100% reliable.

@nik9000
Copy link
Member

nik9000 commented May 4, 2022

I talked to @jtibshirani and she's ok closing this as something that we're not going to have time to do soon. So I'll do that. But! TSDB (#74660) is not going to store _id at all at some point. It'll reconstruct the _id on the fly from the much lower cardinality _tsid and @timestamp fields. They still have fairly high cardinality, but much much much lower than tsdb. We don't have plans to do anything like this in the short term for non-tsdb indices.

@mitar
Copy link
Contributor

mitar commented Oct 4, 2022

I am sad to see this being closed. My use cases (in the same system) are exactly the one described above:

  • I care only about search result _id and no other field. Our system returns that to the client and then client fetches documents through other means (one of reasons for this is that indexing for ElasticSearch generally requires some transformations to the document to be able to map fields better, while our original document structure is easier to work with elsewhere). I currently use _source: false in search queries to suppress sending _source over the wire, but if there are ways to make ES not even load _source internally (just to get _id), that is even better.
  • I am also interested in using _id for sorting, but just for breaking ties when score is equal between multiple documents. I want to predictably but randomly sort such results. So I am using random_score scoring function on _id field with random seed tied to the user (so that user has same ordering even if they redo the query). This is being deprecated so having an alternative for this would be great. (Without having to copy the _id just to support this edge case.)

@mitar
Copy link
Contributor

mitar commented Oct 4, 2022

One option could be that one could optionally enable doc_values on _id if you wanted to do so and sort on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team team-discuss
Projects
None yet
Development

No branches or pull requests

8 participants