-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add query retry for transient errors #15857
Conversation
3ca49fe
to
ea9dae9
Compare
e0e36dd
to
cc51c2b
Compare
Quick initial comment: I think we'll want a new counter in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem we'll have is logging.
Right now, people probably don't dedupe their query logs. If they see one failure for one query id in the logs, then it's counted as failure.
We'll either need to enforce this semantic difference everywhere (AND add the retry information in the logs--it's currently missing), or we'll need to figure out a way to defer the query completed event until after we've confirmed it's not eligible for retry--perhaps once it's expired and cleaned up altogether.
presto-main/src/main/java/com/facebook/presto/server/protocol/Query.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/Query.java
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java
Show resolved
Hide resolved
(For posterity, I know this is being worked on.). We'll also want to consider the queueing behavior for retried queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bring retry queries to the beginning of the queue
...o-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java
Outdated
Show resolved
Hide resolved
Query existingRetryQuery = retriedQueries.putIfAbsent(queryId, query); | ||
if (existingRetryQuery != null) { | ||
// other thread has already create the new retry query | ||
// remove the registered new query to reuse the existing one | ||
queries.remove(query.getQueryId()); | ||
query = existingRetryQuery; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's two or more server-side retries (meaning, sever has requested client to send a retry), how will this look different from client-side retries (meaning, the client retries this GET)?
I also am not sure of the intent of removing the query from the queries
map--I believe this will cause a leak.
My guess what should happen is this: we can include the retryCount in the path. We can remove the retriedQueries
map and just compute
the new value for the retry. If, inside the callback for compute
, the old, existing query has a retryCount that is equal or greater to the retryCount in the path, then do nothing, otherwise return the new query (which will cause the queries
map to be updated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked the logic. Actually it's more complicated. For any query, the "retry count" will always be one: suppose Q1 failed, then we will send GET to /retry/Q1 and generates Q2. If further Q2 fails, we will send GET to /retry/Q2 and generates Q3, and so on. I started this sequence so we can have the chance to create Q_{n+1} when Q_{n} has not yet been purged. If we go with Q1 with retry count, then we will need other complicated bookkeeping data structure to make sure Q1 is not purged before we exhaust all chances. Note that to difference if we have exhausted all chances is fairly complicated in Query::getNextResultWithRetry
.
So I made a bit different change with synchronized method still keeping the retryQueries. Let me know if that looks ok.
865b84f
to
0f70dc1
Compare
We have some issue with the test infra related to actions/runner-images#841 |
c2350ea
to
171c000
Compare
...o-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java
Outdated
Show resolved
Hide resolved
The patch allows queries to retry with a circuit breaker logic. Only communication failures are allowed with retries.
depends on https://github.com/facebookexternal/presto-facebook/pull/1503