-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing: index-time sorting #6720
Comments
I think this is an awesome feature for specific problems. Yet I think we have to make sure that this never ever corrupts an index during a merge. IMO we should restrict this to numeric fields and allow only |
why limit to numeric? whether a field is numeric or not is completely unrelated to this feature: sorry that just doesn't make sense. Since lucene 4.8 you can supply any arbitrary Sort here. |
I am super concerned about memory consumption so I should have rather said restricted to on-disk DV. I don't think it's unrelated |
my question was with regards to string fields though. I think those are ok too (as well as multiple-level sorts or whatever lucene has). If we want to limit it to DV to prevent traps in merging thats fine, but I think string is just as good as numeric. Separately, the real issue here is how well early termination will work with situations like aggregations. I don't yet fully understand how this is working, but it looks like it simply plugs into collector. so you would get wrong aggregation counts with early termination if we aren't careful. |
+1 on allowing string doc values and multi-level sorts. We don't need to have them from the first iteration but having them eventually would be nice. Then we could early-terminate on any prefix on the sort that is configured at index time. Even without aggregations, simple things like the total number of hits could only be approximated. I guess there would need to be a per-query flag to turn early termination on. |
This sounds like the PageRank use case, but there is another as well. Sorting can be very useful for compression, to allow "clumps" of documents to be near each other (thus producing smaller deltas in postings). For example, let's say we have songs lyrics as our documents, and we sort by genre. If we restrict the search to a specific genre, we will quickly narrow into a small portion of the postings list, with small deltas. Also, if we do a search across all genres, but for a term which is highly correlated with a genre (e.g. "cowboy" for "Country"), then again we will have good compression and search speed for that term. My point is just that you don't need early termination for this to be useful. It can help speed up queries if you just understand your data and the common types of queries you run. |
Algolia amazing speed is achieved through index-time ordering, and they do support typo-tolerance as well. |
Looking forward for this feature. |
Comment copied over from #8873: Documents of different types tend to have different fields, which can lead to problems of sparsity when documents are stored in the order that they were indexed. If we sort documents by type, then missing values (eg in doc values or norms) can be represented very efficiently. On merge, documents should be sorted by type (assuming there is more than one type). Additionally, for certain use cases, it can make sense to apply a secondary sort (eg on timestamp). This secondary level should be customizable per index, and should probably be dynamically updatable (although it would only have effect on later merges). Relates to #8870 |
We probably want to have https://issues.apache.org/jira/browse/LUCENE-6766 first. |
@jpountz would this help in the typical ELK case where one indexes logs (older first, then newer ones) and typically short in the opposite order - first new ones, then old ones? In other words, with a typical ELK case where logs are being indexed would the natural indexing order already result in ideally sorted index, or would one have to run something (whatever LUCENE-6766 and this ES issue expose) to inverse the sort order of such an index? |
I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated. In my opinion, this would be more useful for pure search use-cases that only need to fetch the top docs and have a low to moderate ingestion rate. |
I suspect it'd be faster to answer "find me the last 10 things that match this query" which is pretty useful. You'd have to not care about the count of total matches and I'm not sure if that outweighs the index time cost. I don't know this feature well so I can speak to what that index time cost amounts to. |
After reading the description I though it would be pretty easy to add the feature but then I realized that it would break the blockjoin/nested support. Seems like we would need another sorting merge policy which takes the nested document into account... |
Any plans/updates on this issue? Using index time sorting would have huge performance impacts on queries like "give me 10 items with the highest rating" when having millions of them. Seems like Algolia is using exactly this approach: https://www.algolia.com/doc/faq/index-configuration/how-to-sort-the-results-with-a-specific-attribute |
We are slowly making progress: index sorting is going to be better integrated into Lucene in the upcoming 6.2 release https://issues.apache.org/jira/browse/LUCENE-6766 so when 6.2 is out we can start working on the integration. This will take time so don't expect it to be implemented and automatically terminate queries early etc. soon, but there is definitely progress being made. |
Hey, guys, is there any progress? Currently I need index-time sorting in my product, and I wonder if I can wait util you guys implement this feature or I have to do it myself first. |
There has been good progress on the Lucene side, including a change @jimczi just pushed to Lucene 6.x (https://issues.apache.org/jira/browse/LUCENE-7579) for its future 6.5.0 release, to give a big boost to indexing performance with index-time sorting. You can see the impact of that change in the nightly sparse Lucene benchmarks: it's annotation V in this chart: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput But we still have plenty of work to do to expose sorted indices in ES. |
@mikemccand Ok, I'll watch this issue first. Thanks! |
@mikemccand any branch working on this so we can take a look at it ? |
There is no branch yet. |
sorry to bother to propose some thoughts on this issue:
|
@makeyang thanks. |
This change adds an index setting to define how the documents should be sorted inside each Segment. It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on disk. It is not allowed to use a `nested` fields inside an index that defines an index sorting since `nested` fields relies on the original sort of the index. This change does not add early termination capabilities in the search layer. This will be added in a follow up. Relates #6720
Index time sorting has landed in master: There are still missing pieces that we need to add in order to make index sorting useful:
|
This is a spin off for elastic#24398. This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates elastic#6720
…4864) This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates #6720
Sorted scroll search can use early termination when the index sort matches the scroll search sort. The optimization can be done after the first query (which still needs to collect all documents) by applying a query that only matches documents that are greater than the last doc retrieved in the previous request. Since the index is sorted, retrieving the list of documents that are greater than the last doc only requires a binary search on each segment. This change introduces this new query called `SortedSearchAfterDocQuery` and apply it when possible. Scrolls with this optimization will search all documents on the first request and then will early terminate each segment after $size doc for any subsequent requests. Relates elastic#6720
…25138) Sorted scroll search can use early termination when the index sort matches the scroll search sort. The optimization can be done after the first query (which still needs to collect all documents) by applying a query that only matches documents that are greater than the last doc retrieved in the previous request. Since the index is sorted, retrieving the list of documents that are greater than the last doc only requires a binary search on each segment. This change introduces this new query called `SortedSearchAfterDocQuery` and apply it when possible. Scrolls with this optimization will search all documents on the first request and then will early terminate each segment after $size doc for any subsequent requests. Relates #6720
Index sorting is now fully supported in master. We'll continue to add enhancements and features based on sorted indices but this issue can be closed. |
Lucene has index-time sorting and early query termination capabilities. It would be nice to integrate it into Elasticsearch in order to speed up queries whose sort order matches the index order. This presentation contains some information about these features and their implementation in Lucene.
The text was updated successfully, but these errors were encountered: