Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks: Retry if task can't be written #35054

Merged
merged 11 commits into from
Nov 30, 2018
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Oct 29, 2018

Adds about a minute worth of backoffs and retries to saving task
results so it is much more likely that a busy cluster won't lose task
results. This isn't an ideal solution to losing task results, but it is
an incremental improvement. If all of the retries fail when still log
the task result, but that is far from ideal.

Closes #33764

Adds about a minute worth of backoffs and retries to saving task
results so it is *much* more likely that a busy cluster won't lose task
results. This isn't an ideal solution to losing task results, but it is
an incremental improvement. If all of the retries fail when still log
the task result, but that is far from ideal.

Closes elastic#33764
@nik9000 nik9000 added >enhancement :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v7.0.0 v6.6.0 labels Oct 29, 2018
@nik9000 nik9000 requested a review from imotov October 29, 2018 17:49
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@nik9000
Copy link
Member Author

nik9000 commented Oct 30, 2018

14:00:25 [ant:checkstyle] [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/server/src/test/java/org/elasticsearch/action/admin/cluster/node/tasks/TaskStorageRetryIT.java:25:8: Unused import - org.elasticsearch.common.unit.TimeValue. [UnusedImports]

Oh, me.

@nik9000
Copy link
Member Author

nik9000 commented Nov 15, 2018

@imotov what do you think of this?

@@ -159,7 +178,13 @@ public void onResponse(IndexResponse indexResponse) {

@Override
public void onFailure(Exception e) {
listener.onFailure(e);
if (backoff.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be a bit more selective about when we retry here? Like if the failure is TOO_MANY_REQUESTS we should definitely retry, but if it is a mapping failure, it's unlikely somebody will fix it within a minute, so we definitely shouldn't. And, to be honest, I cannot think about any other failure besides EsRejectedExecutionException where it would definitely make sense to retry. If the index is temporary unavailable and we are here, that means that we already retired, right? If settings or mappings are screwed up or shard is freaking out for any other reasons, we probably should just give up as well. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the index is temporary unavailable and we are here, that means that we already retired, right?

Like we try to write to the index and the shard isn't allocated. Like if both of the nodes that had a copy of it went down. That seems fairly unlikely, but possible. I wonder if I should just do rejected execution exception for now and look at that case in a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if I should just do rejected execution exception for now and look at that case in a follow up.

I think this would be a great first step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have another look now!

@imotov
Copy link
Contributor

imotov commented Nov 15, 2018

The change looks good in general but I think we might need to be a bit more conservative about when to retry.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am not sure a total retry time of 1 min is enough. It is consistent with what we do in bulk by default, but I feel like we could go with something longer here considering a somewhat internal nature of the request and that the backoff policy is not configurable here.

* The backoff policy to use when saving a task result fails. The total wait
* time is 59460 milliseconds, about a minute.
*/
private static final BackoffPolicy STORE_BACKOFF_POLICY = BackoffPolicy.exponentialBackoff(timeValueMillis(500), 11);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this is so short?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just switched it to 10 minutes. That should be plenty of time!

@nik9000 nik9000 merged commit df56f07 into elastic:master Nov 30, 2018
nik9000 added a commit that referenced this pull request Dec 3, 2018
Adds about a minute worth of backoffs and retries to saving task
results so it is *much* more likely that a busy cluster won't lose task
results. This isn't an ideal solution to losing task results, but it is
an incremental improvement. If all of the retries fail when still log
the task result, but that is far from ideal.

Closes #33764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. >enhancement v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants