-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to skip non-competitive documents on field sort [LUCENE-9280] #10320
Comments
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) draft PR: apache/lucene-solr#1351 |
Michael Sokolov (@msokolov) (migrated from JIRA) This is interesting - could you provide a bit of motivation why this is desirable though? Since DocValues are stored in docid order, how will skipping work? I guess we know the min/max values for a block of docs, so could potentially skip over a block at a time? |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) > Since DocValues are stored in docid order, how will skipping work? We have decided to do this similar how This is the most desirable for a case where docs were added sequentially with ever increasing values, e.g. a logging use case with an increasing date field. For example, if doc1 has a field1 value 1, doc2 – 2, doc3 – 3,.... doc1000000 – 1000000, and we needed to retrieve top 3 docs with smallest field1 values, currently we would iterate through all docs. With the proposed change, as soon as collect first 3 docs and set the bottom 3, the collector's iterator can be updated to remove all other docs as they are not competitive.
|
Adrien Grand (@jpountz) (migrated from JIRA) Woohoo. I can try to give a little more background about this feature. The question that usually comes first is the benefit of this approach compared to index sorting. Index sorting provides greater speedups than this change, but is limited to one field and one sort order. While it's possible to create multiple indexes that all have a different sort order, it's quite expensive in terms of resources, and creates new problems e.g. if you need all your indices to have a consistent view of the documents that exist in the index. Even if you only sort by a single field, this feature might be relevant. For instance with time series, it's common to sort data by descending timestamp, e.g. to live-tail events that match a filter, but you sometimes also need to sort in ascending order to find the first occurrence of a problem, which is often a useful piece of information for root cause analysis. With index sorting, sorting in reverse order of the index sort is a worst-case scenario. The approach to store min/max values per block in doc values would work too. Something I like with this approach is that it doesn't require adding new APIs to doc values, and that it doesn't only work with numbers. For instance, this approach could also be used to speed up sorting by geo-distance too (using the same principle as LatLonPointDistanceFeatureQuery) without requiring the introduction of a new type of doc values that supports multiple dimensions. |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) @jpountz Thanks Adrien for the explanation about index sorting. It makes sense I was just wondering about this sentence of yours: " it's common to sort data by descending timestamp, e.g. to live-tail events that match a filter, but you sometimes also need to sort in ascending order". I guess this PR is not going to help for this case either, as segments and docs within them are processed in order they are written (in desc order), and we will not reach bottom values until the very end. But I was wondering if as a follow-up work to address this case, we can imitate the work of @jimczi that during a query pre-sorts segments based on a query sort field, and process segments in that sort order. |
Adrien Grand (@jpountz) (migrated from JIRA) I was just talking about it with Jim earlier today. I think this would make this change more complete indeed. |
David Smiley (@dsmiley) (migrated from JIRA) Cool! |
ASF subversion and git services (migrated from JIRA) Commit b0333ab in lucene-solr's branch refs/heads/master from Mayya Sharipova LUCENE-9280: Collectors to skip noncompetitive documents (#1351) Similar how scorers can update their iterators to skip non-competitive |
Michael McCandless (@mikemccand) (migrated from JIRA) Can this be resolved? Will we backport to 8.x? |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) @mikemccand Thanks for checking it. I will resolve the issue. We have a separate issue for backporting to 8.x, as it requires introducing a new parameter for SortField. |
ASF subversion and git services (migrated from JIRA) Commit 7b12849 in lucene-solr's branch refs/heads/branch_8x from Mayya Sharipova #10424: Backport for field sort optimization (#1610) Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (#1351) Similar how scorers can update their iterators to skip non-competitive To enable sort optimization for numeric sort fields,
|
ASF subversion and git services (migrated from JIRA) Commit 9922067 in lucene-solr's branch refs/heads/master from Mayya Sharipova #10489 Skip docs with _doc sort and "after" (#1725)
Relates to LUCENE-9280 |
Adrien Grand (@jpountz) (migrated from JIRA) Closing after the 9.0.0 release |
Today collectors, once they collect enough docs, can instruct scorers to update their iterators to skip non-competitive documents. This is applicable only for a case when we need top docs by _score.
It would be nice to also have an ability to skip non-competitive docs when we need top docs sorted by other fields different from _score.
Migrated from LUCENE-9280 by Mayya Sharipova (@mayya-sharipova), resolved Jul 07 2020
Linked issues:
The text was updated successfully, but these errors were encountered: