Indexing: index-time sorting #6720

jpountz · 2014-07-03T20:34:16Z

Lucene has index-time sorting and early query termination capabilities. It would be nice to integrate it into Elasticsearch in order to speed up queries whose sort order matches the index order. This presentation contains some information about these features and their implementation in Lucene.

s1monw · 2014-07-04T07:47:43Z

I think this is an awesome feature for specific problems. Yet I think we have to make sure that this never ever corrupts an index during a merge. IMO we should restrict this to numeric fields and allow only asc / desc and that field is automatically marked as using on-disk docvalues to make sure we don't fail the merge due to OOM.

rmuir · 2014-07-04T09:27:12Z

why limit to numeric? whether a field is numeric or not is completely unrelated to this feature: sorry that just doesn't make sense. Since lucene 4.8 you can supply any arbitrary Sort here.

s1monw · 2014-07-04T09:29:03Z

I am super concerned about memory consumption so I should have rather said restricted to on-disk DV. I don't think it's unrelated

rmuir · 2014-07-04T12:51:30Z

my question was with regards to string fields though. I think those are ok too (as well as multiple-level sorts or whatever lucene has). If we want to limit it to DV to prevent traps in merging thats fine, but I think string is just as good as numeric.

Separately, the real issue here is how well early termination will work with situations like aggregations. I don't yet fully understand how this is working, but it looks like it simply plugs into collector. so you would get wrong aggregation counts with early termination if we aren't careful.

jpountz · 2014-07-04T13:23:21Z

+1 on allowing string doc values and multi-level sorts. We don't need to have them from the first iteration but having them eventually would be nice. Then we could early-terminate on any prefix on the sort that is configured at index time.

Even without aggregations, simple things like the total number of hits could only be approximated. I guess there would need to be a per-query flag to turn early termination on.

rjernst · 2014-07-08T17:07:52Z

This sounds like the PageRank use case, but there is another as well.

Sorting can be very useful for compression, to allow "clumps" of documents to be near each other (thus producing smaller deltas in postings). For example, let's say we have songs lyrics as our documents, and we sort by genre. If we restrict the search to a specific genre, we will quickly narrow into a small portion of the postings list, with small deltas. Also, if we do a search across all genres, but for a term which is highly correlated with a genre (e.g. "cowboy" for "Country"), then again we will have good compression and search speed for that term.

My point is just that you don't need early termination for this to be useful. It can help speed up queries if you just understand your data and the common types of queries you run.

ofavre · 2014-08-22T18:43:42Z

Algolia amazing speed is achieved through index-time ordering, and they do support typo-tolerance as well.
I think it's safe to say that this can very well be a major feature.

Vineeth-Mohan · 2015-05-31T17:09:51Z

Looking forward for this feature.

clintongormley · 2015-06-06T18:12:10Z

Comment copied over from #8873:

Documents of different types tend to have different fields, which can lead to problems of sparsity when documents are stored in the order that they were indexed. If we sort documents by type, then missing values (eg in doc values or norms) can be represented very efficiently.

On merge, documents should be sorted by type (assuming there is more than one type). Additionally, for certain use cases, it can make sense to apply a secondary sort (eg on timestamp). This secondary level should be customizable per index, and should probably be dynamically updatable (although it would only have effect on later merges).

Relates to #8870

jpountz · 2015-09-04T11:05:05Z

We probably want to have https://issues.apache.org/jira/browse/LUCENE-6766 first.

otisg · 2015-09-05T16:13:44Z

@jpountz would this help in the typical ELK case where one indexes logs (older first, then newer ones) and typically short in the opposite order - first new ones, then old ones?

In other words, with a typical ELK case where logs are being indexed would the natural indexing order already result in ideally sorted index, or would one have to run something (whatever LUCENE-6766 and this ES issue expose) to inverse the sort order of such an index?

jpountz · 2015-09-05T16:44:49Z

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

In my opinion, this would be more useful for pure search use-cases that only need to fetch the top docs and have a low to moderate ingestion rate.

nik9000 · 2015-09-07T01:28:40Z

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

I suspect it'd be faster to answer "find me the last 10 things that match this query" which is pretty useful. You'd have to not care about the count of total matches and I'm not sure if that outweighs the index time cost. I don't know this feature well so I can speak to what that index time cost amounts to.

jimczi · 2015-12-07T15:11:20Z

After reading the description I though it would be pretty easy to add the feature but then I realized that it would break the blockjoin/nested support. Seems like we would need another sorting merge policy which takes the nested document into account...

JanJakes · 2016-07-19T14:13:44Z

Any plans/updates on this issue? Using index time sorting would have huge performance impacts on queries like "give me 10 items with the highest rating" when having millions of them. Seems like Algolia is using exactly this approach: https://www.algolia.com/doc/faq/index-configuration/how-to-sort-the-results-with-a-specific-attribute

jpountz · 2016-07-19T14:20:57Z

We are slowly making progress: index sorting is going to be better integrated into Lucene in the upcoming 6.2 release https://issues.apache.org/jira/browse/LUCENE-6766 so when 6.2 is out we can start working on the integration. This will take time so don't expect it to be implemented and automatically terminate queries early etc. soon, but there is definitely progress being made.

weiqiyiji · 2017-01-17T15:10:30Z

Hey, guys, is there any progress? Currently I need index-time sorting in my product, and I wonder if I can wait util you guys implement this feature or I have to do it myself first.

mikemccand · 2017-01-17T15:48:35Z

There has been good progress on the Lucene side, including a change @jimczi just pushed to Lucene 6.x (https://issues.apache.org/jira/browse/LUCENE-7579) for its future 6.5.0 release, to give a big boost to indexing performance with index-time sorting. You can see the impact of that change in the nightly sparse Lucene benchmarks: it's annotation V in this chart: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput

But we still have plenty of work to do to expose sorted indices in ES.

weiqiyiji · 2017-01-17T16:59:54Z

@mikemccand Ok, I'll watch this issue first. Thanks!

makeyang · 2017-03-27T13:13:37Z

@mikemccand any branch working on this so we can take a look at it ?

jpountz · 2017-03-27T13:23:42Z

There is no branch yet.

makeyang · 2017-03-31T14:15:12Z

sorry to bother to propose some thoughts on this issue:

add mapping parameters: index_sort with values[no, asc, desc] for datatype[long,integer,short,byte,double,float] and index_sort_order with no-negative int value if and only if fields setting index_sort
these 2 parameters can't be modified once it is set for a index
index_sort_order is used to decide order of multi fields and 1 stands for the first sequence to decide sort
it should not allow null value for fields with mapping parameter index_sort set not no for easy to implement
it should not allow indices with no-zero-doc to add fields with index_sort parameter for easy to implement

jimczi · 2017-03-31T14:21:18Z

@makeyang thanks.
I am planning to add the index sorting spec in the index settings rather than on the index mapping since you can define an index sort based on multiple fields.
Anyway I am currently iterating on this feature and I'll add updates (and link to the PR when I open it) to this issue so stay tuned !

This change adds an index setting to define how the documents should be sorted inside each Segment. It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on disk. It is not allowed to use a `nested` fields inside an index that defines an index sorting since `nested` fields relies on the original sort of the index. This change does not add early termination capabilities in the search layer. This will be added in a follow up. Relates #6720

jimczi · 2017-04-19T12:40:27Z

Index time sorting has landed in master:
#24055

There are still missing pieces that we need to add in order to make index sorting useful:

Documentation regarding when to (not) use index sorting and how
Early termination for queries that rely on the index sort
Faster sorted scroll queries when the query sort matches the index sort.

This is a spin off for elastic#24398. This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates elastic#6720

…4864) This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates #6720

Sorted scroll search can use early termination when the index sort matches the scroll search sort. The optimization can be done after the first query (which still needs to collect all documents) by applying a query that only matches documents that are greater than the last doc retrieved in the previous request. Since the index is sorted, retrieving the list of documents that are greater than the last doc only requires a binary search on each segment. This change introduces this new query called `SortedSearchAfterDocQuery` and apply it when possible. Scrolls with this optimization will search all documents on the first request and then will early terminate each segment after $size doc for any subsequent requests. Relates elastic#6720

…25138) Sorted scroll search can use early termination when the index sort matches the scroll search sort. The optimization can be done after the first query (which still needs to collect all documents) by applying a query that only matches documents that are greater than the last doc retrieved in the previous request. Since the index is sorted, retrieving the list of documents that are greater than the last doc only requires a binary search on each segment. This change introduces this new query called `SortedSearchAfterDocQuery` and apply it when possible. Scrolls with this optimization will search all documents on the first request and then will early terminate each segment after $size doc for any subsequent requests. Relates #6720

jimczi · 2017-06-12T07:48:30Z

Index sorting is now fully supported in master.
The documentation for this feature can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html

We'll continue to add enhancements and features based on sorted indices but this issue can be closed.

jpountz added the adoptme label Jul 3, 2014

clintongormley assigned mikemccand Jul 4, 2014

clintongormley added v2.0.0 and removed adoptme labels Jul 4, 2014

mikemccand removed their assignment Jul 8, 2014

clintongormley mentioned this issue Jun 6, 2015

Support sorting of documents at merge time #8873

Closed

clintongormley added v2.0.0 v2.1.0 and removed v2.0.0-beta1 v2.0.0 labels Aug 13, 2015

jpountz added the stalled label Sep 4, 2015

clintongormley added v2.2.0 :Core/Infra/Core Core issues without another label and removed v2.1.0 labels Nov 20, 2015

spinscale added the v2.3.0 label Dec 23, 2015

clintongormley added v2.4.1 and removed v2.4.0 labels Aug 24, 2016

clintongormley added v2.4.2 and removed v2.4.1 labels Sep 23, 2016

clintongormley removed the v2.4.2 label Nov 6, 2016

clintongormley removed the stalled label Nov 25, 2016

clintongormley assigned jimczi Nov 25, 2016

jimczi mentioned this issue Apr 3, 2017

Add a setting which specifies a list of setting #23883

Closed

jimczi mentioned this issue Apr 11, 2017

Enable index-time sorting #24055

Merged

jimczi mentioned this issue Apr 28, 2017

Add support for early termination of search request #24398

Closed

jimczi mentioned this issue May 24, 2017

Automatically early terminate search query based on index sorting #24864

Merged

jimczi mentioned this issue Jun 8, 2017

Speed up sorted scroll when the index sort matches the search sort #25138

Merged

jimczi closed this as completed Jun 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing: index-time sorting #6720

Indexing: index-time sorting #6720

jpountz commented Jul 3, 2014

s1monw commented Jul 4, 2014

rmuir commented Jul 4, 2014

s1monw commented Jul 4, 2014

rmuir commented Jul 4, 2014

jpountz commented Jul 4, 2014

rjernst commented Jul 8, 2014

ofavre commented Aug 22, 2014

Vineeth-Mohan commented May 31, 2015

clintongormley commented Jun 6, 2015

jpountz commented Sep 4, 2015

otisg commented Sep 5, 2015

jpountz commented Sep 5, 2015

nik9000 commented Sep 7, 2015

jimczi commented Dec 7, 2015

JanJakes commented Jul 19, 2016

jpountz commented Jul 19, 2016

weiqiyiji commented Jan 17, 2017

mikemccand commented Jan 17, 2017

weiqiyiji commented Jan 17, 2017

makeyang commented Mar 27, 2017

jpountz commented Mar 27, 2017

makeyang commented Mar 31, 2017

jimczi commented Mar 31, 2017

jimczi commented Apr 19, 2017

jimczi commented Jun 12, 2017

Indexing: index-time sorting #6720

Indexing: index-time sorting #6720

Comments

jpountz commented Jul 3, 2014

s1monw commented Jul 4, 2014

rmuir commented Jul 4, 2014

s1monw commented Jul 4, 2014

rmuir commented Jul 4, 2014

jpountz commented Jul 4, 2014

rjernst commented Jul 8, 2014

ofavre commented Aug 22, 2014

Vineeth-Mohan commented May 31, 2015

clintongormley commented Jun 6, 2015

jpountz commented Sep 4, 2015

otisg commented Sep 5, 2015

jpountz commented Sep 5, 2015

nik9000 commented Sep 7, 2015

jimczi commented Dec 7, 2015

JanJakes commented Jul 19, 2016

jpountz commented Jul 19, 2016

weiqiyiji commented Jan 17, 2017

mikemccand commented Jan 17, 2017

weiqiyiji commented Jan 17, 2017

makeyang commented Mar 27, 2017

jpountz commented Mar 27, 2017

makeyang commented Mar 31, 2017

jimczi commented Mar 31, 2017

jimczi commented Apr 19, 2017

jimczi commented Jun 12, 2017