-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks: Retry if task can't be written #35054
Conversation
Adds about a minute worth of backoffs and retries to saving task results so it is *much* more likely that a busy cluster won't lose task results. This isn't an ideal solution to losing task results, but it is an incremental improvement. If all of the retries fail when still log the task result, but that is far from ideal. Closes elastic#33764
Pinging @elastic/es-distributed |
Oh, me. |
@imotov what do you think of this? |
@@ -159,7 +178,13 @@ public void onResponse(IndexResponse indexResponse) { | |||
|
|||
@Override | |||
public void onFailure(Exception e) { | |||
listener.onFailure(e); | |||
if (backoff.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be a bit more selective about when we retry here? Like if the failure is TOO_MANY_REQUESTS
we should definitely retry, but if it is a mapping failure, it's unlikely somebody will fix it within a minute, so we definitely shouldn't. And, to be honest, I cannot think about any other failure besides EsRejectedExecutionException where it would definitely make sense to retry. If the index is temporary unavailable and we are here, that means that we already retired, right? If settings or mappings are screwed up or shard is freaking out for any other reasons, we probably should just give up as well. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the index is temporary unavailable and we are here, that means that we already retired, right?
Like we try to write to the index and the shard isn't allocated. Like if both of the nodes that had a copy of it went down. That seems fairly unlikely, but possible. I wonder if I should just do rejected execution exception for now and look at that case in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if I should just do rejected execution exception for now and look at that case in a follow up.
I think this would be a great first step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have another look now!
The change looks good in general but I think we might need to be a bit more conservative about when to retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am not sure a total retry time of 1 min is enough. It is consistent with what we do in bulk by default, but I feel like we could go with something longer here considering a somewhat internal nature of the request and that the backoff policy is not configurable here.
* The backoff policy to use when saving a task result fails. The total wait | ||
* time is 59460 milliseconds, about a minute. | ||
*/ | ||
private static final BackoffPolicy STORE_BACKOFF_POLICY = BackoffPolicy.exponentialBackoff(timeValueMillis(500), 11); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this is so short?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just switched it to 10 minutes. That should be plenty of time!
Adds about a minute worth of backoffs and retries to saving task results so it is *much* more likely that a busy cluster won't lose task results. This isn't an ideal solution to losing task results, but it is an incremental improvement. If all of the retries fail when still log the task result, but that is far from ideal. Closes #33764
Adds about a minute worth of backoffs and retries to saving task
results so it is much more likely that a busy cluster won't lose task
results. This isn't an ideal solution to losing task results, but it is
an incremental improvement. If all of the retries fail when still log
the task result, but that is far from ideal.
Closes #33764