Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to skip non-competitive documents on field sort [LUCENE-9280] #10320

Closed
asfimport opened this issue Mar 16, 2020 · 13 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Mar 16, 2020

Today collectors, once they collect enough docs, can instruct scorers to update their iterators to skip non-competitive documents. This is applicable only for a case when we need top docs by _score.

It would be nice to also have an ability to skip non-competitive docs when we need top docs sorted by other fields different from _score.


Migrated from LUCENE-9280 by Mayya Sharipova (@mayya-sharipova), resolved Jul 07 2020
Linked issues:

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

draft PR: apache/lucene-solr#1351

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

This is interesting - could you provide a bit of motivation why this is desirable though? Since DocValues are stored in docid order, how will skipping work? I guess we know the min/max values for a block of docs, so could potentially skip over a block at a time?

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

> Since DocValues are stored in docid order, how will skipping work? 

We have decided to do this similar how LongDistanceFeatureQuery works.  The initial iterator for a collector is indeed a docValues, but then as we set the bottom value in the comparator, the iterator is updated from PointValues to include only docs that are lower than bottom value (in case of asc sort). 

This is the most desirable for a case where docs were added sequentially with ever increasing values, e.g. a logging use case with an increasing date field. For example, if doc1 has a  field1 value 1, doc2 – 2, doc3 – 3,.... doc1000000 – 1000000, and we needed to retrieve top 3 docs with smallest field1 values,  currently we would iterate through all docs. With the proposed change, as soon as collect first 3 docs and set the bottom 3, the collector's iterator can be updated to remove all other docs as they are not competitive. 

 

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Woohoo. I can try to give a little more background about this feature. The question that usually comes first is the benefit of this approach compared to index sorting. Index sorting provides greater speedups than this change, but is limited to one field and one sort order. While it's possible to create multiple indexes that all have a different sort order, it's quite expensive in terms of resources, and creates new problems e.g. if you need all your indices to have a consistent view of the documents that exist in the index. Even if you only sort by a single field, this feature might be relevant. For instance with time series, it's common to sort data by descending timestamp, e.g. to live-tail events that match a filter, but you sometimes also need to sort in ascending order to find the first occurrence of a problem, which is often a useful piece of information for root cause analysis. With index sorting, sorting in reverse order of the index sort is a worst-case scenario.

The approach to store min/max values per block in doc values would work too. Something I like with this approach is that it doesn't require adding new APIs to doc values, and that it doesn't only work with numbers. For instance, this approach could also be used to speed up sorting by geo-distance too (using the same principle as LatLonPointDistanceFeatureQuery) without requiring the introduction of a new type of doc values that supports multiple dimensions.

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@jpountz  Thanks Adrien for the explanation about index sorting. It makes sense

I was just wondering about this sentence of yours: " it's common to sort data by descending timestamp, e.g. to live-tail events that match a filter, but you sometimes also need to sort in ascending order".  I guess this PR is not going to help for this case either, as segments and docs within them are processed in order they are written (in desc order), and we will not reach bottom values until the very end. But I was wondering if as a follow-up work to address this case, we can imitate the work of @jimczi that during a query pre-sorts segments based on a query sort field, and process segments in that sort order.

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

I was just talking about it with Jim earlier today. I think this would make this change more complete indeed.

@asfimport
Copy link
Author

David Smiley (@dsmiley) (migrated from JIRA)

Cool!

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit b0333ab in lucene-solr's branch refs/heads/master from Mayya Sharipova
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b0333ab

LUCENE-9280: Collectors to skip noncompetitive documents (#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Can this be resolved?  Will we backport to 8.x?

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@mikemccand Thanks for checking it. I will resolve the issue. We have a separate issue for backporting to 8.x, as it requires introducing a new parameter for SortField.

@asfimport
Copy link
Author

asfimport commented Jul 31, 2020

ASF subversion and git services (migrated from JIRA)

Commit 7b12849 in lucene-solr's branch refs/heads/branch_8x from Mayya Sharipova
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7b12849

#10424: Backport for field sort optimization (#1610)

Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

To enable sort optimization for numeric sort fields,
the following needs to be done:

  1. the field should be indexed with both doc_values and points, that
    must have the same field name and same data
  2. SortField#setCanUsePoints must be set
  3. totalHitsThreshold should not be set to max value.

@asfimport
Copy link
Author

asfimport commented Sep 8, 2020

ASF subversion and git services (migrated from JIRA)

Commit 9922067 in lucene-solr's branch refs/heads/master from Mayya Sharipova
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9922067

#10489 Skip docs with _doc sort and "after" (#1725)

  • Enhance DocComparator to provide an iterator over competitive
    documents when searching with "after". This iterator can quickly position
    on the desired "after" document skipping all documents and segments before
    "after".

  • Redesign numeric comparators to provide skipping functionality
    by default.

Relates to LUCENE-9280

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant