-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9449 Skip docs with _doc sort and "after" #1725
LUCENE-9449 Skip docs with _doc sort and "after" #1725
Conversation
Enhance DocComparator to provide an iterator over competitive documents when searching with "after" FieldDoc. This iterator can quickly position on the desired "after" document, and skip all documents before "after" or even whole segments that contain only documents before "after". Related to LUCENE-9280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass and left some comments but it looks great. I like the fact that the optimization can be done entirely outside of the TopFieldCollector
.
lucene/core/src/java/org/apache/lucene/search/FilteringDocLeafComparator.java
Outdated
Show resolved
Hide resolved
* @return comparator wrapped as a filtering comparator or the original comparator if the filtering functionality | ||
* is not implemented for it | ||
*/ | ||
public static FieldComparator<?> wrapToFilteringComparator(FieldComparator<?> comparator, boolean reverse, boolean singleSort) { | ||
public static FieldComparator<?> wrapToFilteringComparator(FieldComparator<?> comparator, boolean reverse, boolean singleSort, | ||
boolean hasAfter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to add the hasAfter
? Can we check the if the topValue
in the DocComparator is greater than 0 instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that moment topValue is not set yet, it will be set later in the constructor of PagingFieldCollector
.
*/ | ||
public static <T extends FieldValueHitQueue.Entry> FieldValueHitQueue<T> create(SortField[] fields, int size, | ||
boolean filterNonCompetitiveDocs) { | ||
boolean filterNonCompetitiveDocs, boolean hasAfter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid adding hasAfter
here ? See my comment below.
@@ -121,7 +121,7 @@ protected boolean lessThan(final Entry hitA, final Entry hitB) { | |||
} | |||
|
|||
// prevent instantiation and extension. | |||
private FieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs) { | |||
private FieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs, boolean hasAfter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that hasAfter
is really needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point of time, topValue for comparators is not set yet, that's why we need hasAfter
.
As an alternative to this implementation, we can pass FieldDoc after
to FieldValueHitQueue.create
and setTopValue during FieldValueHitQueue
creation.
@@ -95,8 +95,8 @@ protected boolean lessThan(final Entry hitA, final Entry hitB) { | |||
*/ | |||
private static final class MultiComparatorsFieldValueHitQueue<T extends FieldValueHitQueue.Entry> extends FieldValueHitQueue<T> { | |||
|
|||
public MultiComparatorsFieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs) { | |||
super(fields, size, filterNonCompetitiveDocs); | |||
public MultiComparatorsFieldValueHitQueue(SortField[] fields, int size, boolean filterNonCompetitiveDocs, boolean hasAfter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that hasAfter
is really needed here.
* This comparator is used when there is sort by _doc asc together with "after" FieldDoc. | ||
* The comparator provides an iterator that can quickly skip to the desired "after" document. | ||
*/ | ||
public class FilteringDocLeafComparator implements FilteringLeafFieldComparator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to AfterDocLeafComparator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like AfterDocLeafComparator
, but I renamed to FilteringAfterDocLeafComparator
for consistency with all other filtering comparators. Please let me know if you still like it to be renamed
@jimczi Thank you for the initial feedback. I tried to address it, can you please continue the review |
@jimczi Thank you for the initial feedback. I have tried to address it. Please continue to review, when you have time. I have caught up with @jimczi offline, and his main comment was that we need to redesign all comparators is such a way that they all provide skipping functionality by default (without wrapping them into FilteringComparator and FilteringLeafComparator). While this be a target of following PRs, the goal of this PR is do this only for So addressing @jimczi's feedback, in this PR:
|
8616330
to
fabfca5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova . I left two comments that I find important, one regarding the introduction of new functions in the filtering leaf comparator. I don't think they are needed. And the other regarding the early termination of queries sorted by doc id. We don't need to visit more documents if the queue is full and the the hits threshold has been reached.
lucene/core/src/java/org/apache/lucene/search/FieldComparator.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/FilteringLeafFieldComparator.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public void setBottom(int slot) { | ||
bottom = docIDs[slot]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to early terminate here if the total hits threshold has been reached. If it's not reached yet, we can early terminate later in setHitsThresholdReached
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any query sorted by doc id can early terminate after N matches. That's an important aspect of the optimization since it can be handled by the hits threshold transparently. If there is no after
value, the threshold should be an upper bound of the number of document that we will collect in the comparator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Thank you for your comment. I would like to clarify something:
can early terminate after N matches
What's the way to terminate after N matches here? Is it to update an iterator to an empty iterator?
Isn't this termination already handled in TopFieldCollector with the code around canEarlyTerminate
?
Is the plan to remove the code around canEarlyTerminate
in TopFieldCollector
? Should we do after we also handle the case on early termination with the same index and query sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the way to terminate after N matches here? Is it to update an iterator to an empty iterator?
I think so, yes. Updating to an empty iterator is what we do for constant score queries for instance.
Isn't this termination already handled in TopFieldCollector with the code around canEarlyTerminate?
The code is only for sorted index and while I think that we should move this code in the sort comparator, I agree that it's out of the scope of this PR. Early termination should be handled in the field comparator so that we don't need to add new logic in the main collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Thanks for the feedback. I have added an early termination after N matches for DocComparator in 21de242.
The code is only for sorted index
We also have an early termination in a collector based on _doc order. I can remove this code later once I can make sure that all queries use DefaultBulkScorer
that uses collector's iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry I forgot that we added the early termination logic for _doc order in the collector. Maybe that's ok to leave it as it is then. We can revise after we ensure that all bulk scorer uses the collector's iterator ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi That sounds good to me. Thank you for the feedback. Do you have any further comments for this PR?
lucene/core/src/java/org/apache/lucene/search/FilteringLeafFieldComparator.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
Outdated
Show resolved
Hide resolved
- Redesign numeric comparators so by default they provide skipping functionality. This resuled in moving comparators to a separate package. - Remove unnecessary filtering comparator classes, as by default comparators provide skipping functionality - Remove unncessary checks in TopFieldCollector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two questions but it's getting close. Thanks for iterating on this @mayya-sharipova !
/** Creates a new comparator based on document ids for {@code numHits} */ | ||
public DocComparator(int numHits, boolean reverse) { | ||
this.docIDs = new int[numHits]; | ||
// skipping functionality is enabled if we are sorting by _doc in asc order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be activated only if the comparator is used as the primary sort ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Thanks Jim. We also had an implicit protection for a primary sort in MultiLeafFieldComparator.java, but I agree it is good to explicitly indicate a primary sort.
Addressed in 485fe4f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also had an implicit protection for a primary sort
Ok I see so the expectation is that setHitsThresholdReached
is only called on the primary sort and we use it as a signal to enable skipping.
Addressed in 485fe4f
I was thinking that we could use the information exposed in SortField#getComparator
, we don't need to add a new callback. However, I think we can rely on what you describe above, setHitsThresholdReached
and competitiveIterator
are the callbacks that we need to enable the optimization. Sorry for the back and forth but I we probably don't need 485fe4f ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Thanks for the feedback.
I was thinking that we could use the information exposed in SortField#getComparator, we don't need to add a new callback
Good point, I haven't noticed that before, I've refactored the code to use sortPos
from this function. Addressed in 92fa246
However, I think we can rely on what you describe above, setHitsThresholdReached and competitiveIterator are the callbacks that we need to enable the optimization. ... we probably don't need 485fe4f
Doing sort optimization only on primary sort in MultiLeafFieldComparator
is implicit and some refactoring of it may bring wrong result. So I think it is good to have extra protection
private void updateIterator() { | ||
if (enableSkipping == false || hitsThresholdReached == false) return; | ||
if (bottomValueSet) { | ||
// since we've collected top N matches, we can early terminate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining that early termination is already implemented in the collector but we'll remove in a follow up ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 485fe4f
Add a note to remove early termination in collector
This reverts commit 485fe4f.
Add a note to remove early termination in collector
1c79ae6
to
92fa246
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments around testing but the changes look good to me.
lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java
Outdated
Show resolved
Hide resolved
@jimczi Thanks for the review so far, I am wondering if you have any further comments? |
- Enhance DocComparator to provide an iterator over competitive documents when searching with "after". This iterator can quickly position on the desired "after" document skipping all documents and segments before "after". - Redesign numeric comparators to move to separate package. Backport for #LUCENE-9449
Fix bug how iterator with skipping functionality advances and produces docs Relates to apache#1725
Fix bug how iterator with skipping functionality advances and produces docs Relates to #1725
Some collectors provide iterators that can efficiently skip non-competitive docs. When using DefaultBulkScorer#score function we create a conjunction of scorerIterator and collectorIterator. As collectorIterator always starts from a docID = -1, and for creation of conjunction iterator we need all of its sub-iterators to be on the same doc, the creation of conjuction iterator will fail if scorerIterator has already been advanced to some other document. This patch ensures that we create conjunction between scorerIterator and collectorIterator only if scorerIterator has not been advanced yet. Relates to apache#1725 Relates to apache#1937
Enhance DocComparator to provide an iterator over competitive documents
when searching with "after" FieldDoc.
This iterator can quickly position on the desired "after" document,
and skip all documents before "after" or even whole segments
that contain only documents before "after".
Related to LUCENE-9280