Failed to create snapshot [BUG] #240

tocw · 2020-06-16T13:03:18Z

Hello,
We are running Open Distro release v1.8.0.
The problem I observe is that ISM fails on "create snapshot" action for random indices.
Actually the snapshot is created but ISM fails to complete the action and does not delete the index in the next step.
If I retry failed actionsonly a couple of them at most is able to finish.

Here is the only log message connected to that problem:

[2020-06-15T15:00:44,404][INFO ][o.e.s.SnapshotsService   ] [************] snapshot [s3-repository-7-7:offline-retention-test.dev-2020.06.15-15:00:44.361/1GI64gWeRuKuBQ3jHcKFaA] started

[2020-06-15T15:00:44,404][WARN ][o.e.s.SnapshotsService   ] [************] [s3-repository-7-7][offline-retention-test.dev-2020.06.15-15:00:44.361] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [s3-repository-7-7:offline-retention-test.dev-2020.06.15-15:00:44.361]  a snapshot is already running
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:307) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]

Here is the policy

{
    "policy_id": "offline-retention-test.dev",
    "description": "Policy for application: offline-retention-test.dev",
    "last_updated_time": 1592213668857,
    "schema_version": 1,
    "error_notification": null,
    "default_state": "Hot",
    "states": [
        {
            "name": "Hot",
            "actions": [
                {
                    "rollover": {
                        "min_size": "40gb",
                        "min_index_age": "1h"
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "Warm",
                    "conditions": {
                        "min_index_age": "2h"
                    }
                }
            ]
        },
        {
            "name": "Warm",
            "actions": [],
            "transitions": [
                {
                    "state_name": "Delete",
                    "conditions": {
                        "min_index_age": "3h"
                    }
                }
            ]
        },
        {
            "name": "Delete",
            "actions": [
                {
                    "snapshot": {
                        "repository": "s3-repository-7-7",
                        "snapshot": "offline-retention-test.dev"
                    }
                },
                {
                    "delete": {}
                }
            ],
            "transitions": []
        }
    ]
}

Wasn't this supposed to be fixed by PR #172

The text was updated successfully, but these errors were encountered:

dbbaughe · 2020-06-16T15:59:15Z

Hi @tocw,

Do you happen to know if it's failing during the AttemptSnapshotStep or the WaitForSnapshotStep?
Looking through the code it should be retrying on its own for any concurrent exceptions. Those showing up in the logs should be fine.

Could you post the explain API response for the affected indices?

tocw · 2020-06-17T16:25:44Z

Hello @dbbaughe
I was not able to get that information from the log messages. If you know how to do it please send me some instructions.

Here is the response from explain API, each failed index returns the same response

{
  "f9549d2a-3d40dcb6acc1.prd-000081" : {
    "index.opendistro.index_state_management.policy_id" : "f9549d2a-3d40dcb6acc1.prd",
    "index" : "f9549d2a-3d40dcb6acc1.prd-000081",
    "index_uuid" : "NscPsHP4TfuhVZvaJby0cg",
    "policy_id" : "f9549d2a-3d40dcb6acc1.prd",
    "policy_seq_no" : 23739,
    "policy_primary_term" : 23,
    "rolled_over" : true,
    "state" : {
      "name" : "Delete",
      "start_time" : 1592387796488
    },
    "action" : {
      "name" : "snapshot",
      "start_time" : 1592387796488,
      "index" : 0,
      "failed" : true,
      "consumed_retries" : 0,
      "last_retry_time" : 0
    },
    "retry_info" : {
      "failed" : false,
      "consumed_retries" : 0
    },
    "info" : {
      "cause" : "[************][172.17.0.2:9300][cluster:admin/snapshot/create]",
      "message" : "Failed to create snapshot for index: f9549d2a-3d40dcb6acc1.prd-000081"
    }
  }
}

dbbaughe · 2020-06-17T16:45:47Z

Hi @tocw,

It looks like it's the AttemptSnapshotStep. Your message "Failed to create snapshot for index: f9549d2a-3d40dcb6acc1.prd-000081" is from here. It seems to be getting passed the ConcurrentSnapshotExecutionException catch so makes me think it's failing for a different reason and you're just seeing that error for some of them.

I do see this log right below which should be logging out the exact error that it's from. Do you see anywhere in your elasticsearch.log an exception that has a line above it with Failed to create snapshot for index: ...?

tocw · 2020-06-18T09:30:23Z

Hello @dbbaughe ,
thank you for fast response. I think I got the log line that may help you find the problem cause:

[2020-06-17T07:12:06,802][INFO ][c.a.o.i.s.t.AttemptTransitionStep] [*****************] 5b5e6fd9-a333ed241903.prd-000187 transition conditions evaluated to true [indexCreationDate=1591168326309, numDocs=null, indexSize=null,stepStartTime=1591276926613]

...

[2020-06-17T07:17:06,620][INFO ][c.a.o.i.s.s.AttemptSnapshotStep] [*****************] Executing snapshot on 5b5e6fd9-a333ed241903.prd-000187
[2020-06-17T07:17:06,669][ERROR][c.a.o.i.s.s.AttemptSnapshotStep] [*****************] Failed to create snapshot for index: 5b5e6fd9-a333ed241903.prd-000187

...

[2020-06-17T07:19:10,316][INFO ][c.a.o.i.ManagedIndexRunner] [*****************] Previous execution failed to update step status, isIdempotent=true
[2020-06-17T07:19:10,426][INFO ][c.a.o.i.s.s.WaitForSnapshotStep] [*****************] Waiting for snapshot to complete...
[2020-06-17T07:19:10,482][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [*****************] uncaught exception in thread [DefaultDispatcher-worker-14]
org.elasticsearch.transport.RemoteTransportException: [******************][172.17.0.2:9300][cluster:admin/snapshot/status]
Caused by: org.elasticsearch.snapshots.SnapshotMissingException: [s3-repository:null] is missing
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.lambda$loadRepositoryData$3(TransportSnapshotsStatusAction.java:231) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:68) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.StepListener.whenComplete(StepListener.java:78) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.loadRepositoryData(TransportSnapshotsStatusAction.java:213) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.buildResponse(TransportSnapshotsStatusAction.java:202) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:105) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:64) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:99) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:170) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]
[2020-06-17T07:19:18,248][INFO ][c.a.o.j.s.JobScheduler   ] [*****************] Descheduling jobId: h94rl6JgTWm5QFE_Y4ptHw

dbbaughe · 2020-06-18T16:15:50Z

Hi @tocw,

It looks like the log was cutoff right where I wanted to see the error. There should be an exception right under this line:
[2020-06-17T07:17:06,669][ERROR][c.a.o.i.s.s.AttemptSnapshotStep] [*****************] Failed to create snapshot for index: 5b5e6fd9-a333ed241903.prd-000187

Although this log does let us see some other bugs so that's good!

There is no try/catch around the WaitForSnapshot step so we need to add that.

And It seems like the snapshotName is added into the info map from the AttemptSnapshot step here. And in the WaitForSnapshot step if it fails (and retries) or is still waiting for it to complete it seems like it will overwrite the info map e.g. here and lose the snapshotName. In which case in the next execution the name will be null and you'll get the error you see for the WaitForSnapshotStep in that log.

These should be relatively quick fixes for the WaitFor step, just need to add try/catch and move the snapshotName from info to the ActionProperties in the ActionMetaData using ForceMerge as an example.

Still need to see the exception for the AttemptSnapshot step though.

tocw · 2020-06-19T14:09:34Z

Hello,
sorry for the delay. Here is the missing part

[2020-06-19T08:44:37,865][INFO ][c.a.o.i.s.s.AttemptSnapshotStep] [els-hot-3.**************] Executing snapshot on 15a42509-a5d4-61ed7373d7e4.dev-000160
[2020-06-19T08:44:37,899][ERROR][c.a.o.i.s.s.AttemptSnapshotStep] [els-hot-3.**************] Failed to create snapshot for index: 15a42509-a5d4-61ed7373d7e4.dev-000160
org.elasticsearch.transport.RemoteTransportException: [els-master-1.**************][172.17.0.2:9300][cluster:admin/snapshot/create]
Caused by: org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [s3-repository:15a42509-a5d4-61ed7373d7e4.dev-2020.06.19-08:44:37.866] cannot snapshot while a snapshot deletion is in-progress in [SnapshotDeletionsInProgress[a5b8ee3b-9dbc-1f96bcbd6f69.uat-2020.05.11-08:13:01.910]]
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:290) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]
[2020-06-19T08:44:51,897][INFO ][c.a.o.j.s.JobSweeper     ] [els-hot-3.**************] Descheduling job Nt2-tlIpSM-Hi_wghhHnFw on index .opendistro-ism-config
[2020-06-19T08:44:51,897][INFO ][c.a.o.j.s.JobScheduler   ] [els-hot-3.**************] Descheduling jobId: Nt2-tlIpSM-Hi_wghhHnFw

dbbaughe · 2020-06-25T02:22:57Z

Hi @tocw,

I was able to reproduce this. It's not caught in our tests as our integration tests run a single node cluster and the jobs are run on the (only) master node. The error seems to be in multi node clusters where the job runs on a data node and does a transport request to the master node. The master node then throws an error like ConcurrentSnapshotExecutionException and wraps it in a RemoteTransportException to be returned to the caller node which then doesn't get caught by our catch (e: ConcurrentSnapshotExecutionException) { ... }.

Thanks for reporting, will push out a fix shortly.

tocw added the bug Something isn't working label Jun 16, 2020

dbbaughe mentioned this issue Jun 25, 2020

Fixes snapshot bugs #244

Merged

dbbaughe closed this as completed Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to create snapshot [BUG] #240

Failed to create snapshot [BUG] #240

tocw commented Jun 16, 2020

dbbaughe commented Jun 16, 2020

tocw commented Jun 17, 2020

dbbaughe commented Jun 17, 2020

tocw commented Jun 18, 2020 •

edited

Loading

dbbaughe commented Jun 18, 2020

tocw commented Jun 19, 2020

dbbaughe commented Jun 25, 2020

Failed to create snapshot [BUG] #240

Failed to create snapshot [BUG] #240

Comments

tocw commented Jun 16, 2020

dbbaughe commented Jun 16, 2020

tocw commented Jun 17, 2020

dbbaughe commented Jun 17, 2020

tocw commented Jun 18, 2020 • edited Loading

dbbaughe commented Jun 18, 2020

tocw commented Jun 19, 2020

dbbaughe commented Jun 25, 2020

tocw commented Jun 18, 2020 •

edited

Loading