Add query retry for transient errors #15857

highker · 2021-03-22T07:32:19Z

depends on https://github.com/facebookexternal/presto-facebook/pull/1503

== RELEASE NOTES ==

General Changes
*  Add automatic query retry functionality for transient failures. This can be enabled by setting ``per-query-retry-limit`` to a non-zero integer to indicate the per query retry count.

tdcmeehan · 2021-03-26T14:31:12Z

Quick initial comment: I think we'll want a new counter in QueryManagerStats (we'll want to monitor the number of retries--the retries will be a leading indicator in production support that indicates something's going wrong).

tdcmeehan

One problem we'll have is logging.

Right now, people probably don't dedupe their query logs. If they see one failure for one query id in the logs, then it's counted as failure.

We'll either need to enforce this semantic difference everywhere (AND add the retry information in the logs--it's currently missing), or we'll need to figure out a way to defer the query completed event until after we've confirmed it's not eligible for retry--perhaps once it's expired and cleaned up altogether.

presto-main/src/main/java/com/facebook/presto/server/protocol/Query.java

presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java

tdcmeehan · 2021-03-26T14:54:10Z

(For posterity, I know this is being worked on.). We'll also want to consider the queueing behavior for retried queries.

tdcmeehan

Bring retry queries to the beginning of the queue

...o-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java

tdcmeehan · 2021-03-29T00:52:25Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java

+        Query existingRetryQuery = retriedQueries.putIfAbsent(queryId, query);
+        if (existingRetryQuery != null) {
+            // other thread has already create the new retry query
+            // remove the registered new query to reuse the existing one
+            queries.remove(query.getQueryId());
+            query = existingRetryQuery;
+        }


If there's two or more server-side retries (meaning, sever has requested client to send a retry), how will this look different from client-side retries (meaning, the client retries this GET)?

I also am not sure of the intent of removing the query from the queries map--I believe this will cause a leak.

My guess what should happen is this: we can include the retryCount in the path. We can remove the retriedQueries map and just compute the new value for the retry. If, inside the callback for compute, the old, existing query has a retryCount that is equal or greater to the retryCount in the path, then do nothing, otherwise return the new query (which will cause the queries map to be updated).

I double checked the logic. Actually it's more complicated. For any query, the "retry count" will always be one: suppose Q1 failed, then we will send GET to /retry/Q1 and generates Q2. If further Q2 fails, we will send GET to /retry/Q2 and generates Q3, and so on. I started this sequence so we can have the chance to create Q_{n+1} when Q_{n} has not yet been purged. If we go with Q1 with retry count, then we will need other complicated bookkeeping data structure to make sure Q1 is not purged before we exhaust all chances. Note that to difference if we have exhausted all chances is fairly complicated in Query::getNextResultWithRetry.

So I made a bit different change with synchronized method still keeping the retryQueries. Let me know if that looks ok.

highker · 2021-04-21T07:26:46Z

We have some issue with the test infra related to actions/runner-images#841

...o-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java

The patch allows queries to retry with a circuit breaker logic. Only communication failures are allowed with retries.

highker requested a review from tdcmeehan March 22, 2021 07:32

highker force-pushed the retry branch 2 times, most recently from 3ca49fe to ea9dae9 Compare March 22, 2021 07:36

highker changed the title ~~[RFC] Add query retry for transient errors~~ Add query retry for transient errors Mar 26, 2021

highker force-pushed the retry branch 2 times, most recently from e0e36dd to cc51c2b Compare March 26, 2021 07:43

tdcmeehan reviewed Mar 26, 2021

View reviewed changes

presto-main/src/main/java/com/facebook/presto/server/protocol/Query.java Outdated Show resolved Hide resolved

presto-main/src/main/java/com/facebook/presto/server/protocol/Query.java Show resolved Hide resolved

tdcmeehan reviewed Mar 26, 2021

View reviewed changes

presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java Show resolved Hide resolved

highker force-pushed the retry branch from cc51c2b to d802934 Compare March 28, 2021 09:09

tdcmeehan reviewed Mar 28, 2021

View reviewed changes

tdcmeehan reviewed Mar 29, 2021

View reviewed changes

highker force-pushed the retry branch 6 times, most recently from 865b84f to 0f70dc1 Compare April 21, 2021 07:10

highker requested a review from tdcmeehan April 21, 2021 07:26

highker force-pushed the retry branch 2 times, most recently from c2350ea to 171c000 Compare April 22, 2021 04:28

tdcmeehan reviewed Apr 23, 2021

View reviewed changes

...o-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java Outdated Show resolved Hide resolved

tdcmeehan approved these changes Apr 23, 2021

View reviewed changes

highker added 5 commits April 23, 2021 00:03

Remove unused timeoutExecutor for queued queries

2714753

Add retriable flag to error code

1e6c1c9

Add query retry logic for transient failures

e517c49

The patch allows queries to retry with a circuit breaker logic. Only communication failures are allowed with retries.

Add query retry local limit enforcement

688d626

Throw error upon purged retryable queries

26655f4

highker added 4 commits April 23, 2021 00:06

Support query retry only for auto commit transactions

92b868a

Add tests for query retry

af60b04

Handle duplicated retry requests

473dd6d

Bring retry queries to the beginning of the queue

4041176

highker force-pushed the retry branch from 171c000 to 4041176 Compare April 23, 2021 07:15

highker merged commit a34cf43 into prestodb:master Apr 23, 2021

vaishnavibatni mentioned this pull request Apr 27, 2021

Add release notes for 0.252 #16013

Merged

3 tasks

tooptoop4 mentioned this pull request Jun 19, 2021

Query resilience to failures trinodb/trino#2909

Closed

highker mentioned this pull request Mar 27, 2022

Retry broadcast OOM with BHJ disabled within the same spark session #17528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add query retry for transient errors #15857

Add query retry for transient errors #15857

highker commented Mar 22, 2021 •

edited

Loading

tdcmeehan commented Mar 26, 2021

tdcmeehan left a comment

tdcmeehan commented Mar 26, 2021

tdcmeehan left a comment

tdcmeehan Mar 29, 2021

highker Apr 21, 2021

highker commented Apr 21, 2021

Add query retry for transient errors #15857

Add query retry for transient errors #15857

Conversation

highker commented Mar 22, 2021 • edited Loading

tdcmeehan commented Mar 26, 2021

tdcmeehan left a comment

Choose a reason for hiding this comment

tdcmeehan commented Mar 26, 2021

tdcmeehan left a comment

Choose a reason for hiding this comment

tdcmeehan Mar 29, 2021

Choose a reason for hiding this comment

highker Apr 21, 2021

Choose a reason for hiding this comment

highker commented Apr 21, 2021

highker commented Mar 22, 2021 •

edited

Loading