Give _uid doc values #11887

jpountz · 2015-06-26T11:18:31Z

We already use fielddata on the _uid field today in order to implement random sorting. However, given that doc values are disabled on _uid, this will use an insane amount of memory in order to load information in memory given that this field only has unique values.

Having better fielddata for _uid would also be useful in order to have more consistent sort order when paginating or hitting different replicas: we could always add a tie-break on the value of the _uid field.

I think we have several options:

Option 1: Add SORTED doc values to _uid
Option 2: Add BINARY doc values to _uid
Option 3: Add SORTED doc values to _type and _id
Option 4: Add SORTED doc values to _type and BINARY to _id

Option 2 would probably be wasteful in terms of disk space given that we don't have good compression available for binary doc values (and it's hard to implement given that the values can store pretty much anything).

Options 3 and 4 have the benefit of not having to duplicate information if we also want to have doc values on _type and _id: we could even build a BINARY fielddata view for _uid.

Then the other question is whether we should rather use sorted or binary doc values, the former being better for sorting (useful for the consistent sorting use-case) and the latter being better for value lookups (useful for random sorting).

The text was updated successfully, but these errors were encountered:

pickypg · 2015-12-01T16:55:56Z

My vote would be for number three or four:

Option 3: Add SORTED doc values to _type and _id
Option 4: Add SORTED doc values to _type and BINARY to _id

With #14783 we already enable doc values for _type, so it makes sense to individually call out the _id as well. This also allows changes to happen to _type without necessarily breaking _id.

In my experience, most users do not use random sorting, but sorting on _id is not very common either. With only sorting in mind, I would expect to see _id used for non-random sorting a lot more than for random sorting. However, for other use cases, such as referencing the _id in other scenarios that use fielddata (e.g., rarely, but sometimes in aggregations, as well as scripts), it may tip it in favor of being binary.

rmuir · 2015-12-01T17:12:00Z

There is a huge difference between _type and _id when it comes to the expense of docvalues.

_type is a lowish cardinality field. This means if you have only 2 unique values foo and bar, lucene will deduplicate this and write 1 bit per document. If you only have 1 unique type (also common), we will write 0 bits per document, each segment just has in the metadata "all docs have value foo: it costs nothing). So for 10M documents with 2 types, docvalues for this field costs a little over a megabyte.

On the other hand unique ids are high cardinality by definition: deduplication does nothing. Either choice is extremely costly in comparison. Lets consider 10M documents with ids of length in bytes 16 each and make some guesses:

BINARY might you ~ 160MB for the bytes (10M * 16). That is how binary works: its just a straightforward encoding of what you gave it. If the IDs do not have a fixed length, but are instead variable length, then there are additional costs.
SORTED might cost you ~160MB for the bytes (10M * 16) and additional 30MB (10M * 24bpv) for ordinals. The bytes are prefix compressed in this case, because access by ordinal is more important, but for randomish ids this compression will probably not be very efficient. Access to the bytes is also slower, that is the downside of prefix compression (which likely does not help).

I just want to make it clear this is apples and oranges. The fact we turned on docvalues for type is irrelevant when it comes to unique ids. We need very strong use cases and features IMO if we are going to incur this cost.

pickypg · 2015-12-01T17:18:05Z

Very good info @rmuir, as usual. It makes me think that _id supporting doc values should exist (particularly in light of #15155), but it should be opt-in.

mikemccand · 2015-12-01T17:20:21Z

I think it's actually 20 bytes for ES's auto-generated IDs (15 fully binary bytes for the Flake ID, and 20 bytes once it's Base64 encoded) ... but, yeah, this would be a big cost ...

rmuir · 2015-12-01T17:23:51Z

Why do we base64? This probably bloats the terms dict today.

rjernst · 2015-12-01T17:30:59Z

It makes me think that _id supporting doc values should exist

Why can't a user store this in their own field if they want to do something crazy with it? I don't think we should add back configurability for metadata fields, even if it is just one. It was a lot of work to remove that (#8143), and these are our fields, for internal use by elasticsearch. Edge cases like described in #15155 can be handled by a user field with doc values enabled, if they want to do such a crazy thing.

pickypg · 2015-12-01T17:34:55Z

But edge cases like #15155 cannot be handled without some other special handling because it's the access of the _id that is the slowdown. Adding a doc value field does not bypass that cost.

eeeebbbbrrrr · 2015-12-21T23:31:28Z

Hi all! @pickypg linked this issue to me because he knows it's near and dear to my heart.

My exact use case (shameless plug: @zombodb: https://github.com/zombodb/zombodb) is actually what y'all are describing as an "edge case" in #15155 -- that is, ES is being used as a searching index only (ie, store=false, _source=disabled), and an external "source of truth" (Postgres) is used to provide document data back to the user.

While @zombodb might be unique in implementation, I doubt its general approach of providing _id values and using them to later lookup records in an external source is.

An implementation detail is that @zombodb, through a REST endpoint plugin, uses the SCAN+SCROLL API to retrieve all matching _id values, re-encodes them as 6byte pairs, and streams them back as a binary blob.

Against ES v1.7 (and 1.6 and 1.5), benchmarking has shown that the overhead of simply retrieving the _id value completely swamps searching and even the String-->byte encoding ZDB does, so I'm excited y'all are looking at ways to make this better.

(as an aside, I've actually spent quite a bit of time debugging this (against 1.5), and found that if a parent<-->child mapping exists, using its cache to lookup the _id by ordinal (bypassing Lucene, decompressing, and decoding the _id) is nearly an order of magnitude faster. I gave some patches to @pickypg awhile back through my employer's support agreement, but we all kinda decided it wasn't worth the effort of integrating into ES because v2.0 was near and changed everything.)

The idea that such things can "be handled by a user field with doc values enabled" isn't really true, as @pickypg pointed out, because ES is still doing all the work to retrieve the _id value for each hit.

So a half-baked idea would be: What if retrieving the _id could be disabled on a search-by-search basis? Instead, the search request would specify a "user field with doc values enabled" that is a copy of the _id value. Maybe more generally, the ability to elide returning all the fields that are deemed "for internal use by elasticsearch"?

eeeebbbbrrrr · 2015-12-22T22:35:17Z

So I experimented with this idea (disabling returning _id and _type) against v1.7 (I'm not in a position to work with v2.x yet).

All I did was quickly hack FetchPhase.java to set the fieldsVisitor to null and then guard against that in the places it's used, and hardcoded both the "type" and "id" properties of the SearchHit to the empty string.

I then setup a little benchmark using @zombodb.

With a query that returns 14k documents, retrieving all the "ids" in a SCAN+SCROLL loop:

Stock ES: 17 per second
Hacked Version: 120 per second

Of course, all the ids were blank, so it's not very useful!

I then added a doc_values=true field to the index that contains a copy of the _id field. Against the hacked version, I was able to sustain 104 per second. That's about a 6x gain. There's definitely quite a bit of overhead in uid decoding.

In case you care how I hacked FetchPhase.java: https://gist.github.com/eeeebbbbrrrr/9af88e6dc88943450c73

rmuir · 2015-12-23T12:58:06Z

With a query that returns 14k documents

You should just return the top-N instead. That is what lucene is designed to do.

eeeebbbbrrrr · 2015-12-23T17:20:04Z

You should just return the top-N instead. That is what lucene is designed to do.

The point is that there's room for significant improvement around how _uid is handled. I was trying to show what the overhead is -- and on my test data on my laptop, it's about 6x. If a reasonable way to improve this can be found, everyone wins.

rmuir · 2015-12-23T17:51:31Z

Well, lucene just isn't designed to return 14k documents, and by the way docvalues aren't designed for that either. for such huge numbers then a database is a better solution, as it is designed for those use cases.

Just like you wouldn't move your house with a sports car: its a faster vehicle, but its gonna be slower overall.

eeeebbbbrrrr · 2015-12-23T18:20:08Z

Just like you wouldn't move your house with a sports car: its a faster vehicle, but its gonna be slower overall.

I don't know how this is relevant.

If y'all make progress towards improving _uid in whatever way, I'd be happy to help test and benchmark changes.

shamak · 2016-02-27T00:38:21Z

Hey, I stumbled upon this issue while I was trying to do something similar in Elasticsearch. I aimed (ambitiously) to retrieve ~1million documents in under 1 second based on a simple filter query. I noticed the unzipping of the '_id' field was taking a while (~8seconds) using the hot_threads API:

org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsRe
der.java:249) org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
org.elasticsearch.search.fetch.FetchPhase.loadStoredFields(FetchPhase.java:427)
org.elasticsearch.search.fetch.FetchPhase.createSearchHit(FetchPhase.java:219)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:184)
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:401)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:833)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:824)

So I wrote a plugin to stop retrieving the '_id' field, and just retrieve a secondary, integer, doc_values field from the document, specified in the query. I thought this would be super quick but suprisingly, it took almost the same amount of time and now, the hot_threads API showed:

 org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
       org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
       org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
       org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
       org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
       org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:249)
       org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
       org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:70)
       org.elasticsearch.search.lookup.SourceLookup.extractRawValues(SourceLookup.java:145)
       plugin.retrievedocvalues.search.fetch.CustomFetchPhase.createSearchHit(CustomFetchPhase.java:256)
       plugin.retrievedocvalues.search.fetch.CustomFetchPhase.execute(CustomFetchPhase.java:189)
       plugin.retrievedocvalues.search.CustomSearchService.executeFetchPhase(CustomSearchService.java:500)

The query I'm using is against a custom endpoint and the body is:

 { "sort": "_doc",
    "_source": false,
   "fields": ["foo"]
    "size": 1000000,
    "filter": {
        "bool": {
            "should": [
                {
                    "term": {
                        "foo": "bar"
                    }
                },
                {
                    "term": {
                        "baz": "qux"
                    }
                }
            ]
        }
    }

The field 'foo' is an integer field which has doc_values enabled on ES version 1.7.1. The weird thing is the aggregation on the field is super quick, but retrieving the data itself is slow.

I guess the underlying point is it may not be that much faster to enable doc_values on the '_id' field since I can't see much of an improvement, unless I'm missing something which someone here could point out?

bleskes · 2016-02-29T13:05:25Z

@shamak you can use fielddata_fields in your search request to retrieve field values from doc values (or in memory field data). Fields are meant to get stored fields with a fall back to _source (which was removed in 5.x as it is confusing):

GET _search
{
  "fielddata_fields": [ "fieldname"]
}

Note though that getting 10K docs should be done with a scroll rather than getting so many docs at one.

Since we now promote doc values as a possible data storage (next to _source and stored fields) , I wonder if we should support a doc_value_fields entry in the search response. I think more and more people will expect it to be there. /cc @clintongormley @jpountz

clintongormley · 2016-02-29T14:36:29Z

Since we now promote doc values as a possible data storage (next to _source and stored fields) , I wonder if we should support a doc_value_fields entry in the search response. I think more and more people will expect it to be there. /cc @clintongormley @jpountz

That's essentially what fielddata_fields is. We were talking about not using the doc values terminology in favour of in-memory vs on-disk fielddata, although I don't think that's the right tradeoff either. The "fielddata" term has history, and referring to doc values as "on-disk" does them a disservice given that they're usually cached in RAM.

So yes, maybe we should add doc_values_fields (or just doc_values?) as a synonym for fielddata_fields?

jpountz · 2016-03-18T10:47:48Z

Something else we could consider would be to only store the id and type in doc values and not in stored fields in order to not incur a large increase of index size. The benefit is that we would not need any new option on the mappings. However the fetch phase would have to 3 random seeks instead of 1, which could hurt if the index size is much larger than the fs cache.

nik9000 · 2016-03-18T14:52:47Z

However the fetch phase would have to 3 random seeks instead of 1, which could hurt if the index size is much larger than the fs cache.

I suppose then disabling _source would entirely skip stored fields which is kind of cool.

I suspect the _type is going to be cached super fast, especially if we ever decide to sort by _type. Many many use cases use a single type per index so the type lookup is just metadata. Either way I suspect you'd see closer to 2 seeks than 3. Even still, 2 is much worse than 1.

Another question: do we really need to return the _id and _type all the time? I know I typically just wanted some portion of the _source. Usually, like, two or three fields from _source and a couple of highlights. Anyway, maybe we should allow those to be disabled.

pickypg · 2016-03-18T15:26:10Z

I like the idea of not always returning those fields as it's unnecessary information in a lot of cases, especially for the single _type use case. We call it metadata, so maybe we should treat it like metadata and only return it when requested (defaulting to true).

jpountz · 2016-03-18T20:54:32Z

I am fine with allowing some of those meta fields to not be returned, but I tend to like that they are returned by default: it is easy to forget that some things are not available if they are not returned by default, and it makes reindexing easier as you don't have to think about fields that you might need for reindexing: everything is there by default.

jimczi · 2016-06-01T08:58:20Z

I made some tests to check the cost of adding the docvalues to the _id field. I tried to index 1M documents with one field (_id) and different configurations.
I tested 3 configurations:

_id with index=true and stored=true
_id with index=true, stored=false and binary doc values.
_id with index=true, stored=false and sorted doc values.
For the generation of the _id I tried all the configurations with UUIDs.base64UUID and UUIDs.randomBase64UUID.

base64UUID

Configuration	Size	Docs/s	Random access (docs/s)	Sequential access(docs/s)
Stored	12 MB	372,000	532,000	716,000
BinaryDV	26 MB	378,000	9,009,000	40,000,000
SortedDV	13 MB	255,000	4,608,000	16,129,000

The binary doc values doubles the size of the index because they don't use any compression. They are very fast for accessing any values and the indexation speed is almost the same as the stored field.
The sorted doc values have almost the same size than the stored field, this is due to the prefix compression that they use to store the values. They are also quite fast to access any values but the indexation is slower ( ˜= 30% slower).

randomBase64UUID

Configuration	Size	Docs/s	Random access (docs/s)	Sequential access(docs/s)
Stored	49 MB	332,000	719,000	1,751,000
BinaryDV	46 MB	358,000	8,695,000	38,461,000
SortedDV	48MB	246,000	5,524,000	9,523,000

For the random id case, the size of the index is almost the same for the 3 configurations but the sorted doc values are still slower to index the data.

I ran some benchmark and the extra cost during the indexation for the sorted doc values is the sorting of the dictionary (in this case we need to do it twice, one for the terms dictionary of the postings and one for the sorted doc values).
Since each _id is unique I tried to add a way to search on the sorted doc values directly, to do so I just added a file that contains the docID for each _id. It's an extra cost of 4 bytes per documents (it's faster for random access to use a full int instead of a vint or a block compression) and to search a docID it needs to retrieve the ordinal of the term first and then seek/read the docID. For existing _id the search is faster than the one that uses the postings but it can be slower when the _id does not exist and does not share a prefix with the existing ones (the latter case is optimized in the terms dictionary of the postings).
I don't know if this can be something we want to explore but I wanted to propose at least one option if the extra cost of adding doc values to the _id field is prohibitive.

jpountz · 2016-06-01T22:55:36Z

Thanks for testing! The hybrid postings/doc-values idea sounds appealing, but it might be challenging to expose it cleanly? (I haven't thought much about it). Otherwise I am wondering how much LUCENE-7299 would close the gap in terms of indexing speed with SORTED_SET doc values and also that maybe we should implement tome simple compression on binary doc values for such cases (eg. based on the most common ngrams).

rmuir · 2016-06-02T13:27:19Z

I don't think the idea of trying to use the postings dictionary for the term dictionary will work well (besides practical concerns). It will simply be too slow.

The problem is, they are different data structures (it is like trie versus tree, but the difference is important).

The terms dictionary is optimized for lookup by "String", but the docvalues dictionary is optimized for lookup by ordinal.

The docvalues lookup by term is much slower than the postings one, because its not optimized for that. The inverse is true for lookup by ordinal: the entire datastructure is built around doing this with as little overhead as possible: it can do random access within a block, etc.

Given that even a vint for prefix/suffix length is too costly for that case, I don't think we should introduce a branch per-byte with something like n-gram compression. I have run the numbers for that on several datasets (real data: not artificial crap like IDs) and it only saves something like 25% space for that datastructure, depending on the text: in many cases lower than that.

Its important to keep seek-by-ord fast at the moment, because too much code uses sorted/sorted_set docvalues in an abusive fashion with a seek-by-ord for every document, to lookup the text. Elasticsearch has gotten a little better by incorporating things like global ordinals, but it still has bad guys like its scripting support. There are similar cases for other lucene users and even in some lucene modules itself. Historically, people wrote code expecting this to be "ok" and "fast" with fieldcache/fielddata, because that did no compression at all: not even prefix compression. A lot of this code was just ported to docvalues without addressing this, so we still have to keep it fast.

jpountz · 2017-08-01T08:54:54Z

We don't want to add an option to metadata fields, and we don't want to make everyone pay the price for doc values on _id so we will have to do without doc values on _id.

jpountz added the discuss label Jun 26, 2015

clintongormley added v2.0.0-beta1 >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Jun 26, 2015

clintongormley added v2.0.0 and removed v2.0.0-beta1 labels Aug 13, 2015

clintongormley added v2.1.0 and removed v2.1.0 v2.0.0 labels Oct 6, 2015

clintongormley added v2.2.0 and removed v2.1.0 labels Nov 20, 2015

pickypg mentioned this issue Dec 1, 2015

Use doc_values for streaming _uid / _id #15155

Closed

spinscale added v2.3.0 and removed v2.2.0 labels Dec 23, 2015

jpountz mentioned this issue Jan 20, 2016

Add search_after parameter in the SearchAPI #16125

Merged

jimczi mentioned this issue Feb 12, 2016

Evaluate the replacement of the scroll API with search_after #16631

Closed

clintongormley added v2.4.0 and removed v2.3.0 labels Mar 16, 2016

clintongormley mentioned this issue May 5, 2016

_uid should be indexed in Lucene in binary form, not base64 #18154

Closed

jimczi mentioned this issue May 24, 2016

Add the ability to partition a scroll in multiple slices. #18237

Merged

rendel mentioned this issue May 31, 2016

support joining on a metafield sirensolutions/siren-join#63

Open

jimczi mentioned this issue Aug 10, 2016

Add the ability to disable the retrieval of the metadata fields #19918

Closed

clintongormley added v2.4.1 and removed v2.4.0 labels Aug 24, 2016

clintongormley mentioned this issue Aug 24, 2016

random_sort on query with has_child eating insane amounts of memory in field data #20141

Closed

clintongormley added v2.4.2 and removed v2.4.1 labels Sep 23, 2016

clintongormley removed the v2.4.2 label Nov 6, 2016

jpountz mentioned this issue May 9, 2017

Identify documents by their _id. #24460

Merged

weltenwort mentioned this issue May 19, 2017

[context view] using the _uid field as a tiebreaker consumes a lot of fielddata memory elastic/kibana#11925

Closed

jpountz mentioned this issue Jun 15, 2017

Deprecate fielddata on _id/_uid #25240

Closed

jpountz closed this as completed Aug 1, 2017

imotov mentioned this issue Aug 6, 2018

Sort on _id causes OOM #32626

Closed

jtibshirani mentioned this issue Aug 5, 2020

Consider storing _id through doc values. #60778

Closed

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give _uid doc values #11887

Give _uid doc values #11887

jpountz commented Jun 26, 2015

pickypg commented Dec 1, 2015

rmuir commented Dec 1, 2015

pickypg commented Dec 1, 2015

mikemccand commented Dec 1, 2015

rmuir commented Dec 1, 2015

rjernst commented Dec 1, 2015

pickypg commented Dec 1, 2015

eeeebbbbrrrr commented Dec 21, 2015

eeeebbbbrrrr commented Dec 22, 2015

rmuir commented Dec 23, 2015

eeeebbbbrrrr commented Dec 23, 2015

rmuir commented Dec 23, 2015

eeeebbbbrrrr commented Dec 23, 2015

shamak commented Feb 27, 2016

bleskes commented Feb 29, 2016

clintongormley commented Feb 29, 2016

jpountz commented Mar 18, 2016

nik9000 commented Mar 18, 2016

pickypg commented Mar 18, 2016

jpountz commented Mar 18, 2016

jimczi commented Jun 1, 2016

jpountz commented Jun 1, 2016

rmuir commented Jun 2, 2016

jpountz commented Aug 1, 2017

Give _uid doc values #11887

Give _uid doc values #11887

Comments

jpountz commented Jun 26, 2015

pickypg commented Dec 1, 2015

rmuir commented Dec 1, 2015

pickypg commented Dec 1, 2015

mikemccand commented Dec 1, 2015

rmuir commented Dec 1, 2015

rjernst commented Dec 1, 2015

pickypg commented Dec 1, 2015

eeeebbbbrrrr commented Dec 21, 2015

eeeebbbbrrrr commented Dec 22, 2015

rmuir commented Dec 23, 2015

eeeebbbbrrrr commented Dec 23, 2015

rmuir commented Dec 23, 2015

eeeebbbbrrrr commented Dec 23, 2015

shamak commented Feb 27, 2016

bleskes commented Feb 29, 2016

clintongormley commented Feb 29, 2016

jpountz commented Mar 18, 2016

nik9000 commented Mar 18, 2016

pickypg commented Mar 18, 2016

jpountz commented Mar 18, 2016

jimczi commented Jun 1, 2016

base64UUID

randomBase64UUID

jpountz commented Jun 1, 2016

rmuir commented Jun 2, 2016

jpountz commented Aug 1, 2017