-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing. #12392
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
|
core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few editorials, but major concern is there is no synchronization between code which interrupts Split runner threads and the threads themselves. It very much looks like we may interrupt thread which already moved to another split.
@@ -69,6 +69,7 @@ | |||
|
|||
private Duration statusRefreshMaxWait = new Duration(1, TimeUnit.SECONDS); | |||
private Duration infoUpdateInterval = new Duration(3, TimeUnit.SECONDS); | |||
private Duration interruptRunawaySplitsTimeout = new Duration(600, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have a warning about the runaway splits, currently hardcoded. IMO we should validate the two thresholds are coherent (i.e. warning <= timeout), which would require making the warning configurable too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, will configure: "LONG_SPLIT_WARNING_THRESHOLD"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree we should make warning <= timeout, because if the thread runtime for the split > than timeout, the system had already interrupted the split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also felt "/v1/maxActiveSplits" is unnecessary.. Since this PR will print out the run away splits in the log..
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
MultilevelSplitQueue splitQueue, | ||
Ticker ticker) | ||
{ | ||
checkArgument(runnerThreads > 0, "runnerThreads must be at least 1"); | ||
checkArgument(guaranteedNumberOfDriversPerTask > 0, "guaranteedNumberOfDriversPerTask must be at least 1"); | ||
checkArgument(maximumNumberOfDriversPerTask > 0, "maximumNumberOfDriversPerTask must be at least 1"); | ||
checkArgument(guaranteedNumberOfDriversPerTask <= maximumNumberOfDriversPerTask, "guaranteedNumberOfDriversPerTask cannot be greater than maximumNumberOfDriversPerTask"); | ||
checkArgument(interruptRunawaySplitsTimeout.getValue(TimeUnit.SECONDS) >= 1.0, "interruptRunawaySplitsTimeout must be at least 1 second"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 1s
based off of SPLIT_RUN_QUANTA? if so let's just use that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/executor/TestTaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
Hey, I found out the issue is very similar to #7213, in which we want to limit the total time for SQLQueryExecution. In that case, the thread is reused between queries, in our case, the thread is reused between tasks. In either case, I abstract the following programming pattern:
From my personally understanding, the AtomicReference can also be replaced by volatile keyword. |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
|
14a554d
to
8c35f16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs LGTM, just need to word wrap at 80 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, mostly nits
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/PrioritizedSplitRunner.java
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/TestSqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me % nits.
Please update the commit message according to the guideline: https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
implements Comparable<RunningSplitInfo> | ||
{ | ||
private final long startTime; | ||
private final String threadId; | ||
private final Thread thread; | ||
private boolean printed; | ||
private final PrioritizedSplitRunner split; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Store the task id and split info directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please respond (let me know if you missed it or not going to address)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. The split info cannot be cached as it changes as the split execution progresses.
adb169e
to
93e5f09
Compare
Hi, I have addressed all the comments. I also changed the minimum value for the configs. Due to the async nature, the check based on the walltime is inaccurate when the timeout is at a similar scale to the time quota. It can only detect splits that significantly run longer than the split's 1 second time quota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comments
implements Comparable<RunningSplitInfo> | ||
{ | ||
private final long startTime; | ||
private final String threadId; | ||
private final Thread thread; | ||
private boolean printed; | ||
private final PrioritizedSplitRunner split; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please respond (let me know if you missed it or not going to address)
c8d31e7
to
89d7dee
Compare
Merged, thanks! |
Description
Enable Trino to detect and fail tasks that are stuck in long running JONI parsing.
Fix
Query Engine
Trino is a time shared multi-tenant system. For context switching, trino relies on the threadpool workers cooperatively yield itself to the scheduling logic. It is not like an operating system which used a hard interrupt signal to force the process to do the context switch.
Most of the split processing can be finished in a relative short interval, whereas the JONI processing is an exception.
Furthermore, some of the trino features: gathering statistics, relaying on the trino to do callback function(to execute splitFinished() after yield) after the context switch, since there is no context switch in this case, trino can't gather accurate CPU usage for that split.
This PR allows trino to detect and fail tasks that are stuck in long running JONI parsing.
Documentation
( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:
Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing.