Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query retry for transient errors #15857

Merged
merged 9 commits into from
Apr 23, 2021
Merged

Add query retry for transient errors #15857

merged 9 commits into from
Apr 23, 2021

Conversation

highker
Copy link
Contributor

@highker highker commented Mar 22, 2021

depends on https://github.com/facebookexternal/presto-facebook/pull/1503

== RELEASE NOTES ==

General Changes
*  Add automatic query retry functionality for transient failures. This can be enabled by setting ``per-query-retry-limit`` to a non-zero integer to indicate the per query retry count.

@highker highker requested a review from tdcmeehan March 22, 2021 07:32
@highker highker force-pushed the retry branch 2 times, most recently from 3ca49fe to ea9dae9 Compare March 22, 2021 07:36
@highker highker changed the title [RFC] Add query retry for transient errors Add query retry for transient errors Mar 26, 2021
@highker highker force-pushed the retry branch 2 times, most recently from e0e36dd to cc51c2b Compare March 26, 2021 07:43
@tdcmeehan
Copy link
Contributor

Quick initial comment: I think we'll want a new counter in QueryManagerStats (we'll want to monitor the number of retries--the retries will be a leading indicator in production support that indicates something's going wrong).

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem we'll have is logging.

Right now, people probably don't dedupe their query logs. If they see one failure for one query id in the logs, then it's counted as failure.

We'll either need to enforce this semantic difference everywhere (AND add the retry information in the logs--it's currently missing), or we'll need to figure out a way to defer the query completed event until after we've confirmed it's not eligible for retry--perhaps once it's expired and cleaned up altogether.

@tdcmeehan
Copy link
Contributor

(For posterity, I know this is being worked on.). We'll also want to consider the queueing behavior for retried queries.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bring retry queries to the beginning of the queue

Comment on lines 224 to 230
Query existingRetryQuery = retriedQueries.putIfAbsent(queryId, query);
if (existingRetryQuery != null) {
// other thread has already create the new retry query
// remove the registered new query to reuse the existing one
queries.remove(query.getQueryId());
query = existingRetryQuery;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's two or more server-side retries (meaning, sever has requested client to send a retry), how will this look different from client-side retries (meaning, the client retries this GET)?

I also am not sure of the intent of removing the query from the queries map--I believe this will cause a leak.

My guess what should happen is this: we can include the retryCount in the path. We can remove the retriedQueries map and just compute the new value for the retry. If, inside the callback for compute, the old, existing query has a retryCount that is equal or greater to the retryCount in the path, then do nothing, otherwise return the new query (which will cause the queries map to be updated).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked the logic. Actually it's more complicated. For any query, the "retry count" will always be one: suppose Q1 failed, then we will send GET to /retry/Q1 and generates Q2. If further Q2 fails, we will send GET to /retry/Q2 and generates Q3, and so on. I started this sequence so we can have the chance to create Q_{n+1} when Q_{n} has not yet been purged. If we go with Q1 with retry count, then we will need other complicated bookkeeping data structure to make sure Q1 is not purged before we exhaust all chances. Note that to difference if we have exhausted all chances is fairly complicated in Query::getNextResultWithRetry.

So I made a bit different change with synchronized method still keeping the retryQueries. Let me know if that looks ok.

@highker highker force-pushed the retry branch 6 times, most recently from 865b84f to 0f70dc1 Compare April 21, 2021 07:10
@highker highker requested a review from tdcmeehan April 21, 2021 07:26
@highker
Copy link
Contributor Author

highker commented Apr 21, 2021

We have some issue with the test infra related to actions/runner-images#841

@highker highker force-pushed the retry branch 2 times, most recently from c2350ea to 171c000 Compare April 22, 2021 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants