-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give _uid doc values #11887
Comments
My vote would be for number three or four:
With #14783 we already enable doc values for In my experience, most users do not use random sorting, but sorting on |
There is a huge difference between
On the other hand unique ids are high cardinality by definition: deduplication does nothing. Either choice is extremely costly in comparison. Lets consider 10M documents with ids of length in bytes 16 each and make some guesses:
I just want to make it clear this is apples and oranges. The fact we turned on docvalues for type is irrelevant when it comes to unique ids. We need very strong use cases and features IMO if we are going to incur this cost. |
I think it's actually 20 bytes for ES's auto-generated IDs (15 fully binary bytes for the Flake ID, and 20 bytes once it's Base64 encoded) ... but, yeah, this would be a big cost ... |
Why do we base64? This probably bloats the terms dict today. |
Why can't a user store this in their own field if they want to do something crazy with it? I don't think we should add back configurability for metadata fields, even if it is just one. It was a lot of work to remove that (#8143), and these are our fields, for internal use by elasticsearch. Edge cases like described in #15155 can be handled by a user field with doc values enabled, if they want to do such a crazy thing. |
But edge cases like #15155 cannot be handled without some other special handling because it's the access of the |
Hi all! @pickypg linked this issue to me because he knows it's near and dear to my heart. My exact use case (shameless plug: @zombodb: https://github.com/zombodb/zombodb) is actually what y'all are describing as an "edge case" in #15155 -- that is, ES is being used as a searching index only (ie, store=false, _source=disabled), and an external "source of truth" (Postgres) is used to provide document data back to the user. While @zombodb might be unique in implementation, I doubt its general approach of providing An implementation detail is that @zombodb, through a REST endpoint plugin, uses the SCAN+SCROLL API to retrieve all matching Against ES v1.7 (and 1.6 and 1.5), benchmarking has shown that the overhead of simply retrieving the (as an aside, I've actually spent quite a bit of time debugging this (against 1.5), and found that if a parent<-->child mapping exists, using its cache to lookup the The idea that such things can "be handled by a user field with doc values enabled" isn't really true, as @pickypg pointed out, because ES is still doing all the work to retrieve the So a half-baked idea would be: What if retrieving the |
So I experimented with this idea (disabling returning _id and _type) against v1.7 (I'm not in a position to work with v2.x yet). All I did was quickly hack I then setup a little benchmark using @zombodb. With a query that returns 14k documents, retrieving all the "ids" in a SCAN+SCROLL loop: Stock ES: 17 per second Of course, all the ids were blank, so it's not very useful! I then added a In case you care how I hacked FetchPhase.java: https://gist.github.com/eeeebbbbrrrr/9af88e6dc88943450c73 |
You should just return the top-N instead. That is what lucene is designed to do. |
The point is that there's room for significant improvement around how |
Well, lucene just isn't designed to return 14k documents, and by the way docvalues aren't designed for that either. for such huge numbers then a database is a better solution, as it is designed for those use cases. Just like you wouldn't move your house with a sports car: its a faster vehicle, but its gonna be slower overall. |
I don't know how this is relevant. If y'all make progress towards improving |
Hey, I stumbled upon this issue while I was trying to do something similar in Elasticsearch. I aimed (ambitiously) to retrieve ~1million documents in under 1 second based on a simple filter query. I noticed the unzipping of the '_id' field was taking a while (~8seconds) using the hot_threads API:
So I wrote a plugin to stop retrieving the '_id' field, and just retrieve a secondary, integer, doc_values field from the document, specified in the query. I thought this would be super quick but suprisingly, it took almost the same amount of time and now, the hot_threads API showed:
The query I'm using is against a custom endpoint and the body is:
The field 'foo' is an integer field which has doc_values enabled on ES version 1.7.1. The weird thing is the aggregation on the field is super quick, but retrieving the data itself is slow. I guess the underlying point is it may not be that much faster to enable doc_values on the '_id' field since I can't see much of an improvement, unless I'm missing something which someone here could point out? |
@shamak you can use
Note though that getting 10K docs should be done with a scroll rather than getting so many docs at one. Since we now promote doc values as a possible data storage (next to _source and stored fields) , I wonder if we should support a |
That's essentially what So yes, maybe we should add |
Something else we could consider would be to only store the id and type in doc values and not in stored fields in order to not incur a large increase of index size. The benefit is that we would not need any new option on the mappings. However the fetch phase would have to 3 random seeks instead of 1, which could hurt if the index size is much larger than the fs cache. |
I suppose then disabling _source would entirely skip stored fields which is kind of cool. I suspect the _type is going to be cached super fast, especially if we ever decide to sort by _type. Many many use cases use a single type per index so the type lookup is just metadata. Either way I suspect you'd see closer to 2 seeks than 3. Even still, 2 is much worse than 1. Another question: do we really need to return the _id and _type all the time? I know I typically just wanted some portion of the _source. Usually, like, two or three fields from _source and a couple of highlights. Anyway, maybe we should allow those to be disabled. |
I like the idea of not always returning those fields as it's unnecessary information in a lot of cases, especially for the single |
I am fine with allowing some of those meta fields to not be returned, but I tend to like that they are returned by default: it is easy to forget that some things are not available if they are not returned by default, and it makes reindexing easier as you don't have to think about fields that you might need for reindexing: everything is there by default. |
I made some tests to check the cost of adding the docvalues to the _id field. I tried to index 1M documents with one field (_id) and different configurations.
base64UUID
The binary doc values doubles the size of the index because they don't use any compression. They are very fast for accessing any values and the indexation speed is almost the same as the stored field. randomBase64UUID
For the random id case, the size of the index is almost the same for the 3 configurations but the sorted doc values are still slower to index the data. I ran some benchmark and the extra cost during the indexation for the sorted doc values is the sorting of the dictionary (in this case we need to do it twice, one for the terms dictionary of the postings and one for the sorted doc values). |
Thanks for testing! The hybrid postings/doc-values idea sounds appealing, but it might be challenging to expose it cleanly? (I haven't thought much about it). Otherwise I am wondering how much LUCENE-7299 would close the gap in terms of indexing speed with SORTED_SET doc values and also that maybe we should implement tome simple compression on binary doc values for such cases (eg. based on the most common ngrams). |
I don't think the idea of trying to use the postings dictionary for the term dictionary will work well (besides practical concerns). It will simply be too slow. The problem is, they are different data structures (it is like trie versus tree, but the difference is important). The terms dictionary is optimized for lookup by "String", but the docvalues dictionary is optimized for lookup by ordinal. The docvalues lookup by term is much slower than the postings one, because its not optimized for that. The inverse is true for lookup by ordinal: the entire datastructure is built around doing this with as little overhead as possible: it can do random access within a block, etc. Given that even a vint for prefix/suffix length is too costly for that case, I don't think we should introduce a branch per-byte with something like n-gram compression. I have run the numbers for that on several datasets (real data: not artificial crap like IDs) and it only saves something like 25% space for that datastructure, depending on the text: in many cases lower than that. Its important to keep seek-by-ord fast at the moment, because too much code uses sorted/sorted_set docvalues in an abusive fashion with a seek-by-ord for every document, to lookup the text. Elasticsearch has gotten a little better by incorporating things like global ordinals, but it still has bad guys like its scripting support. There are similar cases for other lucene users and even in some lucene modules itself. Historically, people wrote code expecting this to be "ok" and "fast" with fieldcache/fielddata, because that did no compression at all: not even prefix compression. A lot of this code was just ported to docvalues without addressing this, so we still have to keep it fast. |
We don't want to add an option to metadata fields, and we don't want to make everyone pay the price for doc values on |
We already use fielddata on the
_uid
field today in order to implement random sorting. However, given that doc values are disabled on_uid
, this will use an insane amount of memory in order to load information in memory given that this field only has unique values.Having better fielddata for
_uid
would also be useful in order to have more consistent sort order when paginating or hitting different replicas: we could always add a tie-break on the value of the_uid
field.I think we have several options:
_uid
_uid
_type
and_id
_type
and BINARY to_id
Option 2 would probably be wasteful in terms of disk space given that we don't have good compression available for binary doc values (and it's hard to implement given that the values can store pretty much anything).
Options 3 and 4 have the benefit of not having to duplicate information if we also want to have doc values on
_type
and_id
: we could even build a BINARY fielddata view for_uid
.Then the other question is whether we should rather use sorted or binary doc values, the former being better for sorting (useful for the consistent sorting use-case) and the latter being better for value lookups (useful for random sorting).
The text was updated successfully, but these errors were encountered: