Tasks: Retry if task can't be written #35054

nik9000 · 2018-10-29T17:49:19Z

Adds about a minute worth of backoffs and retries to saving task
results so it is much more likely that a busy cluster won't lose task
results. This isn't an ideal solution to losing task results, but it is
an incremental improvement. If all of the retries fail when still log
the task result, but that is far from ideal.

Closes #33764

Adds about a minute worth of backoffs and retries to saving task results so it is *much* more likely that a busy cluster won't lose task results. This isn't an ideal solution to losing task results, but it is an incremental improvement. If all of the retries fail when still log the task result, but that is far from ideal. Closes elastic#33764

elasticmachine · 2018-10-29T17:49:21Z

Pinging @elastic/es-distributed

nik9000 · 2018-10-30T17:00:36Z

14:00:25 [ant:checkstyle] [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/server/src/test/java/org/elasticsearch/action/admin/cluster/node/tasks/TaskStorageRetryIT.java:25:8: Unused import - org.elasticsearch.common.unit.TimeValue. [UnusedImports]

Oh, me.

nik9000 · 2018-11-15T14:30:52Z

@imotov what do you think of this?

imotov · 2018-11-15T15:13:25Z

server/src/main/java/org/elasticsearch/tasks/TaskResultsService.java

@@ -159,7 +178,13 @@ public void onResponse(IndexResponse indexResponse) {

            @Override
            public void onFailure(Exception e) {
-                listener.onFailure(e);
+                if (backoff.hasNext()) {


Should we be a bit more selective about when we retry here? Like if the failure is TOO_MANY_REQUESTS we should definitely retry, but if it is a mapping failure, it's unlikely somebody will fix it within a minute, so we definitely shouldn't. And, to be honest, I cannot think about any other failure besides EsRejectedExecutionException where it would definitely make sense to retry. If the index is temporary unavailable and we are here, that means that we already retired, right? If settings or mappings are screwed up or shard is freaking out for any other reasons, we probably should just give up as well. What do you think?

If the index is temporary unavailable and we are here, that means that we already retired, right?

Like we try to write to the index and the shard isn't allocated. Like if both of the nodes that had a copy of it went down. That seems fairly unlikely, but possible. I wonder if I should just do rejected execution exception for now and look at that case in a follow up.

I wonder if I should just do rejected execution exception for now and look at that case in a follow up.

I think this would be a great first step.

Have another look now!

imotov · 2018-11-15T15:15:24Z

The change looks good in general but I think we might need to be a bit more conservative about when to retry.

imotov

LGTM. I am not sure a total retry time of 1 min is enough. It is consistent with what we do in bulk by default, but I feel like we could go with something longer here considering a somewhat internal nature of the request and that the backoff policy is not configurable here.

imotov · 2018-11-19T21:26:43Z

server/src/main/java/org/elasticsearch/tasks/TaskResultsService.java

+     * The backoff policy to use when saving a task result fails. The total wait
+     * time is 59460 milliseconds, about a minute.
+     */
+    private static final BackoffPolicy STORE_BACKOFF_POLICY = BackoffPolicy.exponentialBackoff(timeValueMillis(500), 11);


Is there a reason why this is so short?

I just switched it to 10 minutes. That should be plenty of time!

Adds about a minute worth of backoffs and retries to saving task results so it is *much* more likely that a busy cluster won't lose task results. This isn't an ideal solution to losing task results, but it is an incremental improvement. If all of the retries fail when still log the task result, but that is far from ideal. Closes #33764

nik9000 added >enhancement :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v7.0.0 v6.6.0 labels Oct 29, 2018

nik9000 requested a review from imotov October 29, 2018 17:49

nik9000 added 2 commits October 30, 2018 13:16

Checkstyle

2b6e310

Merge branch 'master' into task_store_retry

6c53104

imotov reviewed Nov 15, 2018

View reviewed changes

nik9000 added 4 commits November 16, 2018 17:11

Merge branch 'master' into task_store_retry

70a220f

Just rejection

a153ce6

Merge branch 'master' into task_store_retry

0e9ef10

Revert type change and use a cast instead

7488d84

imotov approved these changes Nov 19, 2018

View reviewed changes

nik9000 added 4 commits November 27, 2018 19:03

Merge branch 'master' into task_store_retry

bd0f486

ten minutes

26f7aa2

Merge branch 'master' into task_store_retry

9e28db4

Merge branch 'master' into task_store_retry

d342f9a

nik9000 merged commit df56f07 into elastic:master Nov 30, 2018

nik9000 added the backport pending label Nov 30, 2018

nik9000 removed the backport pending label Dec 3, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks: Retry if task can't be written #35054

Tasks: Retry if task can't be written #35054

nik9000 commented Oct 29, 2018

elasticmachine commented Oct 29, 2018

nik9000 commented Oct 30, 2018

nik9000 commented Nov 15, 2018

imotov Nov 15, 2018

nik9000 Nov 16, 2018

imotov Nov 19, 2018

nik9000 Nov 19, 2018

imotov commented Nov 15, 2018

imotov left a comment

imotov Nov 19, 2018

nik9000 Nov 28, 2018

Tasks: Retry if task can't be written #35054

Tasks: Retry if task can't be written #35054

Conversation

nik9000 commented Oct 29, 2018

elasticmachine commented Oct 29, 2018

nik9000 commented Oct 30, 2018

nik9000 commented Nov 15, 2018

imotov Nov 15, 2018

Choose a reason for hiding this comment

nik9000 Nov 16, 2018

Choose a reason for hiding this comment

imotov Nov 19, 2018

Choose a reason for hiding this comment

nik9000 Nov 19, 2018

Choose a reason for hiding this comment

imotov commented Nov 15, 2018

imotov left a comment

Choose a reason for hiding this comment

imotov Nov 19, 2018

Choose a reason for hiding this comment

nik9000 Nov 28, 2018

Choose a reason for hiding this comment