-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Aggregator#buildTopLevel() to search worker thread. #98715
Conversation
Pinging @elastic/es-analytics-geo (Team:Analytics) |
test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
Outdated
Show resolved
Hide resolved
# Conflicts: # server/src/main/java/org/elasticsearch/search/profile/query/InternalProfileCollector.java # server/src/main/java/org/elasticsearch/search/query/QueryPhaseCollector.java
# Conflicts: # test/framework/src/main/java/org/elasticsearch/search/aggregations/AggregatorTestCase.java
I think in order to fix the timeout exception for search cancellation issue that we see with the Subject: [PATCH] search-worker-overwrites
---
Index: server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java b/server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java
--- a/server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java (revision 5ac1b3057cff072e2d82e932697215ea022be7bb)
+++ b/server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java (date 1694187992273)
@@ -54,10 +54,12 @@
import java.util.Comparator;
import java.util.HashSet;
import java.util.List;
+import java.util.Map;
import java.util.Objects;
import java.util.PriorityQueue;
import java.util.Set;
import java.util.concurrent.CancellationException;
+import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Executor;
import java.util.concurrent.Future;
@@ -489,7 +491,12 @@
// otherwise the state of the aggregation might be undefined and running post collection
// might result in an exception
if (success || timeExceeded) {
- doAggregationPostCollection(collector);
+ try {
+ timeoutOverwrites.put(Thread.currentThread(), true);
+ doAggregationPostCollection(collector);
+ } finally {
+ timeoutOverwrites.remove(Thread.currentThread());
+ }
}
}
}
@@ -505,8 +512,12 @@
return timeExceeded;
}
+ private final Map<Thread, Boolean> timeoutOverwrites = new ConcurrentHashMap<>();
+
public void throwTimeExceededException() {
- throw new TimeExceededException();
+ if (timeoutOverwrites.getOrDefault(Thread.currentThread(), false) == false) {
+ throw new TimeExceededException();
+ }
}
private static class TimeExceededException extends RuntimeException {
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/support/AggregationContext.java
Show resolved
Hide resolved
This change is a bit hard for me to review because I've been away from this code from some time. That said, given that the approach for aggregations is to treat each slice of segments as a mini-shard, it makes sense to me to run |
That's right |
server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java
Outdated
Show resolved
Hide resolved
@@ -498,7 +516,9 @@ public boolean timeExceeded() { | |||
} | |||
|
|||
public void throwTimeExceededException() { | |||
throw new TimeExceededException(); | |||
if (timeoutOverwrites.getOrDefault(Thread.currentThread(), false) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the changes above I am wondering if there are situations where post collection does want timeout to be thrown. Are there? If not is there a way to disable timeouts in post collection directly? I get worried that this type of change will make it harder to migrate to lucene's timeout support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the current logic expect no timeouts during the post-collection phase.
is there a way to disable timeouts in post collection directly?
No as far as I know. The main issue is the deferrable aggregations which run during that phase and they actually access the directory which can throw timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we not doing something similar to what we were doing before? I mean, what is the point of having timeouts if we don't throw an exception when there is one? Should we rather remove the timeout runnable at this point before post collection? Or are you worried that we may not honour cancellation if we do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, we cannot remove the timeout as it affects all running threads. We still want other threads to honour cancellation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that before we also just build the top level internal aggregations when timeout occurred (in AggregationPhase
). This workaround allows us to do the same, otherwise we can't return partial aggregation response (we just fail producing the search response).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it, thanks for explaining.
test/framework/src/main/java/org/elasticsearch/search/aggregations/AggregatorTestCase.java
Outdated
Show resolved
Hide resolved
@ellasticmachine run elasticsearch-ci/part-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR introduces an AggregatorCollector that contains a finish method which performs aggregation postcollection and builds the internal aggregation for this collector.This method is called on the worker thread at the end of the collection phase.
The PR is set as a draft because it found an issue with global ordinals. In this case you get errors looking like:
The issue is that global ordinals are created on the first collector but then reused by the other collectors during the postcollection / internal aggregation building phase.
In order to get around the issue we disable the asserting codec.
closes #98705