-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planning time limit #7213
Planning time limit #7213
Conversation
core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/DataDefinitionExecution.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/execution/resourcegroups/db/TestQueuesDb.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/execution/resourcegroups/db/TestQueuesDb.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java
Show resolved
Hide resolved
6a60698
to
2582129
Compare
Added a test that freezes during planning. To my amazement it got interrupted properly. Changed docs so that setting this limit will not guarantee instant termination of the query |
do you plan to make it? |
Sure, it is easy. Just need to open a PR and propose an absurd default value. Then a lot of of Trino seniors will come arguing about the actual value and settle with something. |
core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/QueryState.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/DataDefinitionExecution.java
Outdated
Show resolved
Hide resolved
import static java.lang.Integer.MAX_VALUE; | ||
import static org.assertj.core.api.Assertions.assertThatThrownBy; | ||
|
||
// Tests relies on timeouts so additional query runners may disturb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what did you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test needs to finish between 5s and 10s time window. Any task running in the background may increase chance of flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is not complete
testing/trino-tests/src/test/java/io/trino/execution/TestQueryTracker.java
Show resolved
Hide resolved
2582129
to
4b9c92d
Compare
core/trino-main/src/main/java/io/trino/execution/QueryStateMachine.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/QueryStateMachine.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/QueryStateMachine.java
Outdated
Show resolved
Hide resolved
4b9c92d
to
029c936
Compare
@@ -13,6 +13,18 @@ The maximum allowed time for a query to be actively executing on the | |||
cluster, before it is terminated. Compared to the run time below, execution | |||
time does not include analysis, query planning or wait times in a queue. | |||
|
|||
``query.max-planning-time`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please review @mosabua
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
import static java.lang.Integer.MAX_VALUE; | ||
import static org.assertj.core.api.Assertions.assertThatThrownBy; | ||
|
||
// Tests relies on timeouts so additional query runners may disturb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is not complete
testing/trino-tests/src/test/java/io/trino/execution/TestQueryTracker.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/execution/TestQueryTracker.java
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/execution/TestQueryTracker.java
Outdated
Show resolved
Hide resolved
Thread planningThread = currentThread(); | ||
stateMachine.getStateChange(PLANNING).addListener(() -> { | ||
if (stateMachine.getQueryState() == FAILED) { | ||
planningThread.interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be a planning thread anymore (when planning is finished) when query goes to FAILED
state. In fact, this executor thread might be running planning for some other query. I think interruption needs to run in critical section:
- planning thread is setting
planningThread=null
when planning is finished - other thread is calling
interrupt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The listener will fire only when the state changes from PLANNING
to FAILED
. Once planning is complete the state changes immediately to STARTING
and the thread will not get interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, you are right, the thread might be reused. I will sync it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. PTAL
4bb9a59
to
b6069c5
Compare
b6069c5
to
c665258
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm % comments
@@ -284,6 +285,20 @@ public QueryManagerConfig setQueryMaxExecutionTime(Duration queryMaxExecutionTim | |||
return this; | |||
} | |||
|
|||
@NotNull | |||
@MinDuration("1ns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not 0? Duration itself can be 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the constraint to be consistent with max-execution-time
{ | ||
assertThatThrownBy(() -> queryRunner.execute("SELECT * FROM t1 WHERE col = 'abc'")) | ||
.hasMessageContaining("Query exceeded the maximum planning time limit of 5.00s"); | ||
assertThat(interrupted).isTrue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is racy, planning thread might have hiccup and set interrupted = true
later than queryRunner.execute
returns.
Use https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CountDownLatch.html instead.
core/trino-main/src/main/java/io/trino/execution/SqlQueryExecution.java
Outdated
Show resolved
Hide resolved
@@ -129,6 +134,9 @@ | |||
private final CostCalculator costCalculator; | |||
private final DynamicFilterService dynamicFilterService; | |||
|
|||
@Nullable | |||
private Thread planningThread; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a field in the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in async lambdas and cannot be a variable. So I could either decorate it in some AtomicReference
type of object of make it a field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be a field, then. You just need to do:
final Thread planningThread = currentThread();
stateMachine.getStateChange(PLANNING).addListener(() -> {
if (stateMachine.getQueryState() == FAILED) {
synchronized (this) {
planningThread.interrupt();
}
}
}, directExecutor());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being set to null
later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why we need that. The only reason would be to prevent the wrong thread from being interrupted, but that can't really happen. If the query transitions to FAILED during planning, it must still be executing the start()
method, since all the planning methods are synchronous. It is only then that the thread is interrupted, which means it will be the right thread.
I can't think of a combination of actions that would result in start()
completing and the query not already be in a state other than PLANNING, which means the interruption wouldn't happen beyond that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was implemented to prevent the interruption in an unlikely case of the thread being already reused for another query. This can be simulated by placing long sleep just before the interrupt()
method. That way we can interrupt the next query executed in this thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State change listener can execute any time in the future (even when planningThread
started serving another query)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. The use of directExecutor()
in the addListener call threw me off.
However, if can happen only if the planning fails from the call to fail()
in the catch at the bottom of this method.
Still, I would write this code as follows to keep the state and logic self-contained and not have to reason about a class-level field, it's lifetime beyond this method (or other methods, etc). Ideally, we should be able to remove listeners, but there's no facility to do so:
final AtomicReference<Thread> planningThread = new AtomicReference<>(currentThread());
stateMachine.getStateChange(PLANNING).addListener(() -> {
if (stateMachine.getQueryState() == FAILED) {
synchronized (this) {
Thread thread = planningThread.get();
if (thread != null) {
thread.interrupt();
}
}
}
}, directExecutor());
try {
PlanRoot plan = planQuery();
// DynamicFilterService needs plan for query to be registered.
// Query should be registered before dynamic filter suppliers are requested in distribution planning.
registerDynamicFilteringQuery(plan);
planDistribution(plan);
}
finally {
synchronized (this) {
planningThread.set(null);
// clear the interrupted flag in case there was a race condition where
// the planning thread was interrupted right after planning completes above
Thread.interrupted();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if can happen only if the planning fails from the call to fail() in the catch at the bottom of this method.
Why only then? Query might be interrupted via timeout and thread.interrupt();
might be delivered in arbitrary time after that.
Other than that keeping logic self-contained within method makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only then? Query might be interrupted via timeout and thread.interrupt(); might be delivered in arbitrary time after that.
Ah yes, scratch that. I was thinking of the case where the interrupt causes those methods to throw.
core/trino-main/src/main/java/io/trino/execution/SqlQueryExecution.java
Outdated
Show resolved
Hide resolved
if (stateMachine.getQueryState() == FAILED) { | ||
synchronized (this) { | ||
if (planningThread != null) { | ||
planningThread.interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this relate to the interrupt delivered here
queryExecutionFuture.cancel(true); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a different thread. queryExecutionFuture.cancel(true)
interrupts thread started in LocalDispatchQueryFactory::createDispatchQuery
.
The actual planning thread is started from that thread in LocalDispatchQuery::startExecution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you push you changes?
c665258
to
053066b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just skimming)
core/trino-main/src/main/java/io/trino/execution/SqlQueryExecution.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/QueryState.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlQueryExecution.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlQueryExecution.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/execution/TestQueryTracker.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/execution/TestQueryTracker.java
Outdated
Show resolved
Hide resolved
053066b
to
f45d628
Compare
@martint PTAL if the current state is ok for you |
@@ -129,6 +134,9 @@ | |||
private final CostCalculator costCalculator; | |||
private final DynamicFilterService dynamicFilterService; | |||
|
|||
@Nullable | |||
private Thread planningThread; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be a field, then. You just need to do:
final Thread planningThread = currentThread();
stateMachine.getStateChange(PLANNING).addListener(() -> {
if (stateMachine.getQueryState() == FAILED) {
synchronized (this) {
planningThread.interrupt();
}
}
}, directExecutor());
core/trino-main/src/main/java/io/trino/execution/QueryTracker.java
Outdated
Show resolved
Hide resolved
f45d628
to
018e9ea
Compare
@martint I used the |
Session defaultSession = testSessionBuilder() | ||
.setCatalog("mock") | ||
.setSchema("default") | ||
.setSystemProperty(QUERY_MAX_PLANNING_TIME, "5s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make this much tighter for this test or come up with a better way to test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly do you mean?
The feature is about making best effort to stop the planning, so ensuring it works in some cases should be enough IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying that 5s is to long to sleep in a test.
testing/trino-tests/src/test/java/io/trino/execution/TestQueryTracker.java
Outdated
Show resolved
Hide resolved
PlanRoot plan = planQuery(); | ||
// DynamicFilterService needs plan for query to be registered. | ||
// Query should be registered before dynamic filter suppliers are requested in distribution planning. | ||
registerDynamicFilteringQuery(plan); | ||
planDistribution(plan); | ||
|
||
synchronized (this) { | ||
if (Thread.interrupted()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary. It should be taken care of by the thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread does not go into pool immediately. This might not be a problem, but it seems cleaner to clear interrupted flag since we interrupt planning specifically. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If planning is interrupted before this line, the above calls (planQuery, registerDynamicFilteringQuery, planDistribution) will throw and this will not be reached.
The only case where it can happen is if there is a race condition where the planning completes just about the time the timeout kicks in and interrupts the thread right after planDistribution
returns or while it's in some part that doesn't get affected by the interrupt so it doesn't throw. It definitely deserves a comment, and the debug statement is not very useful. Also see my comment above on how I'd restructure this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If planning is interrupted before this line, the above calls (planQuery, registerDynamicFilteringQuery, planDistribution) will throw and this will not be reached.
That will only happen if these methods will throw interrupted exception on IO which might not be the case.
Also see my comment above on how I'd restructure this section.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will only happen if these methods will throw interrupted exception on IO which might not be the case.
That should not happen unless the code can do something reasonable with a lack of result from that IO. If we're silently ignoring those interruptions and making results up, that's a bigger problem.
@@ -129,6 +134,9 @@ | |||
private final CostCalculator costCalculator; | |||
private final DynamicFilterService dynamicFilterService; | |||
|
|||
@Nullable | |||
private Thread planningThread; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why we need that. The only reason would be to prevent the wrong thread from being interrupted, but that can't really happen. If the query transitions to FAILED during planning, it must still be executing the start()
method, since all the planning methods are synchronous. It is only then that the thread is interrupted, which means it will be the right thread.
I can't think of a combination of actions that would result in start()
completing and the query not already be in a state other than PLANNING, which means the interruption wouldn't happen beyond that point.
018e9ea
to
a153e60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but let's change the default max planning time to something more reasonable (e.g., 10 minutes). 100 days is meaningless -- we might as well make it optional and default to no limit specified.
c3d8789
to
2c495b7
Compare
mind automation |
There are some problems with GHA at the moment, I will restart it later |
When the query state changes from PLANNING to FAILED, due to error or explicit cancel, the planning thread will get interrupted possibly freeing some of the resources.
2c495b7
to
8bbdb6c
Compare
No description provided.