Add a post-collection hook to LeafCollector. #12380

jpountz · 2023-06-21T11:20:43Z

This adds LeafCollector#finish as a per-segment post-collection hook. While it was already possible to do this sort of things on top of the collector API before, a downside is that the last leaf would need to be post-collected in the current thread instead of using the executor, which is a missed opportunity for making queries concurrent.

Closes #12375

This adds `LeafCollector#finish` as a per-segment post-collection hook. While it was already possible to do this sort of things on top of the collector API before, a downside is that the last leaf would need to be post-collected in the current thread instead of using the executor, which is a missed opportunity for making queries concurrent. Closes apache#12375

zhaih

Overall LGTM, thanks!

lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java

zhaih · 2023-06-28T05:13:44Z

lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

@@ -749,6 +749,9 @@ protected void search(List<LeafReaderContext> leaves, Weight weight, Collector c
          partialResult = true;
        }
      }
+      // Note: this is called if collection ran successfully, including the above special cases of
+      // CollectionTerminatedException and TimeExceededException, but no other exception.
+      leafCollector.finish();


I wonder whether it worths passing in the exceptions if any in case of early termination, but I can't think of a concrete example of how it might be useful right now (maybe user want a faster finish step in case of early terminated by time?), maybe we can add it later if there's a real need?

I can't think of a use-case either. Another argument could be that CollectionTerminatedException is only one way to skip hits, LeafCollector#competitiveIterator and Scorer#setMinCompetitiveScore are other ones, why would we give more information to finish() for one way of skipping and not for other ones?

~~One thing I notice in the case there is no doc of interest, it won't be called (see continue statement), I wonder if we should call it even in that case?~~ we are not building a leaf collector in that case, sorry.

This could be an opportunity for capturing statistics about how often time-limitation is applied?

msokolov

looks good thanks, just one or two little questions

msokolov · 2023-06-28T12:06:31Z

lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

@@ -749,6 +749,9 @@ protected void search(List<LeafReaderContext> leaves, Weight weight, Collector c
          partialResult = true;
        }
      }
+      // Note: this is called if collection ran successfully, including the above special cases of
+      // CollectionTerminatedException and TimeExceededException, but no other exception.
+      leafCollector.finish();


This could be an opportunity for capturing statistics about how often time-limitation is applied?

msokolov · 2023-06-28T12:09:48Z

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java

      // NOTE: this also clears the priorityQueue:
      for (SuggestScoreDoc hit : priorityQueue.getResults()) {
        pendingResults.add(hit);
      }
+
+      // Deduplicate all hits: we already dedup'd efficiently within each segment by


any particular reason to change the order of operations here?

msokolov · 2023-06-28T12:10:47Z

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java

-
-      // Deduplicate all hits: we already dedup'd efficiently within each segment by
-      // truncating the FST top paths search, but across segments there may still be dups:
-      seenSurfaceForms.clear();


oh I see we did it both ways. Probably makes no difference? Still superstitiously I would always want to clear/delete things last just in case...

+1 I liked moving the clear last better

msokolov · 2023-06-28T12:12:57Z

lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingCollector.java

@@ -49,7 +50,9 @@ public LeafCollector getLeafCollector(LeafReaderContext context) throws IOExcept
    assert context.docBase >= previousLeafMaxDoc;
    previousLeafMaxDoc = context.docBase + context.reader().maxDoc();

+    assert hasFinishedCollectingPreviousLeaf;


it's a pity we can't assert that we finished the final leaf

Thanks for your comment, I wanted to look into that and then forgot. It should be doable via AssertingIndexSearcher.

msokolov · 2023-06-28T12:14:13Z

lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingLeafCollector.java

+
+  @Override
+  public void finish() throws IOException {
+    assert finishCalled == false;


Did we previously disallow re-use of LeafCollectors? If not, this could break someone

Yes, it's disallowed by design since LeafCollector#collect must collect doc IDs in doc ID order.

This adds `LeafCollector#finish` as a per-segment post-collection hook. While it was already possible to do this sort of things on top of the collector API before, a downside is that the last leaf would need to be post-collected in the current thread instead of using the executor, which is a missed opportunity for making queries concurrent.

cpoerschke · 2023-09-28T16:06:34Z

lucene/suggest/src/java/org/apache/lucene/search/suggest/document/SuggestIndexSearcher.java

+        LeafCollector leafCollector = collector.getLeafCollector(context);
        try {


Comparing this to the

lucene/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

Lines 710 to 740 in 6d764c3

final LeafCollector leafCollector;

try {

leafCollector = collector.getLeafCollector(ctx);

} catch (

@SuppressWarnings("unused")

CollectionTerminatedException e) {

// there is no doc of interest in this reader context

// continue with the following leaf

continue;

}

BulkScorer scorer = weight.bulkScorer(ctx);

if (scorer != null) {

if (queryTimeout != null) {

scorer = new TimeLimitingBulkScorer(scorer, queryTimeout);

}

try {

scorer.score(leafCollector, ctx.reader().getLiveDocs());

} catch (

@SuppressWarnings("unused")

CollectionTerminatedException e) {

// collection was terminated prematurely

// continue with the following leaf

} catch (

@SuppressWarnings("unused")

TimeLimitingBulkScorer.TimeExceededException e) {

partialResult = true;

}

}

// Note: this is called if collection ran successfully, including the above special cases of

// CollectionTerminatedException and TimeExceededException, but no other exception.

leafCollector.finish();

code I wonder if the getLeafCollector call should move inside the try block here too?

final LeafCollector leafCollector; try { leafCollector = collector.getLeafCollector(context); ... } catch (CollectionTerminatedException e) { ... } if (leafCollector != null) leafCollector.finish();

Trying to remember what was on my mind at the time of the change, I think I wanted to keep the logic simple, since unlike IndexSearcher which may run any Collector, here it may only be a TopSuggestDocsCollector, which never throws a CollectionTerminatedException. I'm ok with moving the getLeafCollector call under the try block though, if you open a PR I'll be happy to approve it.

#12609 opened

jpountz added this to the 9.8.0 milestone Jun 21, 2023

jpountz requested a review from msokolov June 21, 2023 11:20

tidy

e912496

zhaih reviewed Jun 28, 2023

View reviewed changes

jpountz added 2 commits June 28, 2023 07:18

Merge branch 'main' into post-collection-hook

2333f6b

Review feedback.

6fdb597

zhaih approved these changes Jun 28, 2023

View reviewed changes

msokolov reviewed Jun 28, 2023

View reviewed changes

jpountz added 4 commits June 28, 2023 15:58

Check last leaf.

2207cd3

Merge branch 'main' into post-collection-hook

aea87d4

Merge branch 'main' into post-collection-hook

270d8d4

CHANGES

4e4edcf

jpountz merged commit 8811f31 into apache:main Jun 30, 2023

jpountz deleted the post-collection-hook branch June 30, 2023 13:19

reta mentioned this pull request Jul 14, 2023

Update Apache Lucene to 9.8.0-snapshot-4373c3b opensearch-project/OpenSearch#8668

Merged

6 tasks

nknize mentioned this pull request Jul 28, 2023

[BUG] CompletionSuggestSearchIT.testSkipDuplicates is flaky opensearch-project/OpenSearch#8963

Merged

6 tasks

sohami mentioned this pull request Aug 17, 2023

Evaluate using LeafCollector::finish API in lucene for aggregation postCollection processing opensearch-project/OpenSearch#9411

Open

benwtrent mentioned this pull request Aug 22, 2023

[Lucene 9.8-Snapshot] Suggest phase search failures elastic/elasticsearch#98738

Closed

cpoerschke reviewed Sep 28, 2023

View reviewed changes

cpoerschke mentioned this pull request Sep 29, 2023

SuggestIndexSearcher.suggest catches any CollectionTerminatedException (theoretically) thrown by getLeafCollector #12609

Merged

sarthakn7 mentioned this pull request Mar 1, 2024

Lucene 9.8 Yelp/nrtsearch#624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a post-collection hook to LeafCollector. #12380

Add a post-collection hook to LeafCollector. #12380

jpountz commented Jun 21, 2023

zhaih left a comment •

edited

Loading

zhaih Jun 28, 2023

jpountz Jun 28, 2023

iverase Jun 28, 2023 •

edited

Loading

msokolov Jun 28, 2023

msokolov left a comment

msokolov Jun 28, 2023

msokolov Jun 28, 2023

msokolov Jun 28, 2023

jpountz Jun 28, 2023

msokolov Jun 28, 2023

jpountz Jun 28, 2023

msokolov Jun 28, 2023

jpountz Jun 28, 2023

cpoerschke Sep 28, 2023

jpountz Sep 28, 2023

cpoerschke Sep 29, 2023

		LeafCollector leafCollector = collector.getLeafCollector(context);
		try {

	final LeafCollector leafCollector;
	try {
	leafCollector = collector.getLeafCollector(ctx);
	} catch (
	@SuppressWarnings("unused")
	CollectionTerminatedException e) {
	// there is no doc of interest in this reader context
	// continue with the following leaf
	continue;
	}
	BulkScorer scorer = weight.bulkScorer(ctx);
	if (scorer != null) {
	if (queryTimeout != null) {
	scorer = new TimeLimitingBulkScorer(scorer, queryTimeout);
	}
	try {
	scorer.score(leafCollector, ctx.reader().getLiveDocs());
	} catch (
	@SuppressWarnings("unused")
	CollectionTerminatedException e) {
	// collection was terminated prematurely
	// continue with the following leaf
	} catch (
	@SuppressWarnings("unused")
	TimeLimitingBulkScorer.TimeExceededException e) {
	partialResult = true;
	}
	}
	// Note: this is called if collection ran successfully, including the above special cases of
	// CollectionTerminatedException and TimeExceededException, but no other exception.
	leafCollector.finish();

Add a post-collection hook to LeafCollector. #12380

Add a post-collection hook to LeafCollector. #12380

Conversation

jpountz commented Jun 21, 2023

zhaih left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaih left a comment •

edited

Loading

iverase Jun 28, 2023 •

edited

Loading