Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Failed to create snapshot [BUG] #240

Closed
tocw opened this issue Jun 16, 2020 · 7 comments
Closed

Failed to create snapshot [BUG] #240

tocw opened this issue Jun 16, 2020 · 7 comments
Labels
bug Something isn't working

Comments

@tocw
Copy link

tocw commented Jun 16, 2020

Hello,
We are running Open Distro release v1.8.0.
The problem I observe is that ISM fails on "create snapshot" action for random indices.
Actually the snapshot is created but ISM fails to complete the action and does not delete the index in the next step.
If I retry failed actionsonly a couple of them at most is able to finish.

Here is the only log message connected to that problem:

[2020-06-15T15:00:44,404][INFO ][o.e.s.SnapshotsService   ] [************] snapshot [s3-repository-7-7:offline-retention-test.dev-2020.06.15-15:00:44.361/1GI64gWeRuKuBQ3jHcKFaA] started

[2020-06-15T15:00:44,404][WARN ][o.e.s.SnapshotsService   ] [************] [s3-repository-7-7][offline-retention-test.dev-2020.06.15-15:00:44.361] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [s3-repository-7-7:offline-retention-test.dev-2020.06.15-15:00:44.361]  a snapshot is already running
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:307) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]

Here is the policy

{
    "policy_id": "offline-retention-test.dev",
    "description": "Policy for application: offline-retention-test.dev",
    "last_updated_time": 1592213668857,
    "schema_version": 1,
    "error_notification": null,
    "default_state": "Hot",
    "states": [
        {
            "name": "Hot",
            "actions": [
                {
                    "rollover": {
                        "min_size": "40gb",
                        "min_index_age": "1h"
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "Warm",
                    "conditions": {
                        "min_index_age": "2h"
                    }
                }
            ]
        },
        {
            "name": "Warm",
            "actions": [],
            "transitions": [
                {
                    "state_name": "Delete",
                    "conditions": {
                        "min_index_age": "3h"
                    }
                }
            ]
        },
        {
            "name": "Delete",
            "actions": [
                {
                    "snapshot": {
                        "repository": "s3-repository-7-7",
                        "snapshot": "offline-retention-test.dev"
                    }
                },
                {
                    "delete": {}
                }
            ],
            "transitions": []
        }
    ]
}

Wasn't this supposed to be fixed by PR #172

@tocw tocw added the bug Something isn't working label Jun 16, 2020
@dbbaughe
Copy link
Contributor

Hi @tocw,

Do you happen to know if it's failing during the AttemptSnapshotStep or the WaitForSnapshotStep?
Looking through the code it should be retrying on its own for any concurrent exceptions. Those showing up in the logs should be fine.

Could you post the explain API response for the affected indices?

@tocw
Copy link
Author

tocw commented Jun 17, 2020

Hello @dbbaughe
I was not able to get that information from the log messages. If you know how to do it please send me some instructions.

Here is the response from explain API, each failed index returns the same response

{
  "f9549d2a-3d40dcb6acc1.prd-000081" : {
    "index.opendistro.index_state_management.policy_id" : "f9549d2a-3d40dcb6acc1.prd",
    "index" : "f9549d2a-3d40dcb6acc1.prd-000081",
    "index_uuid" : "NscPsHP4TfuhVZvaJby0cg",
    "policy_id" : "f9549d2a-3d40dcb6acc1.prd",
    "policy_seq_no" : 23739,
    "policy_primary_term" : 23,
    "rolled_over" : true,
    "state" : {
      "name" : "Delete",
      "start_time" : 1592387796488
    },
    "action" : {
      "name" : "snapshot",
      "start_time" : 1592387796488,
      "index" : 0,
      "failed" : true,
      "consumed_retries" : 0,
      "last_retry_time" : 0
    },
    "retry_info" : {
      "failed" : false,
      "consumed_retries" : 0
    },
    "info" : {
      "cause" : "[************][172.17.0.2:9300][cluster:admin/snapshot/create]",
      "message" : "Failed to create snapshot for index: f9549d2a-3d40dcb6acc1.prd-000081"
    }
  }
}

@dbbaughe
Copy link
Contributor

Hi @tocw,

It looks like it's the AttemptSnapshotStep. Your message "Failed to create snapshot for index: f9549d2a-3d40dcb6acc1.prd-000081" is from here. It seems to be getting passed the ConcurrentSnapshotExecutionException catch so makes me think it's failing for a different reason and you're just seeing that error for some of them.

I do see this log right below which should be logging out the exact error that it's from. Do you see anywhere in your elasticsearch.log an exception that has a line above it with Failed to create snapshot for index: ...?

@tocw
Copy link
Author

tocw commented Jun 18, 2020

Hello @dbbaughe ,
thank you for fast response. I think I got the log line that may help you find the problem cause:

[2020-06-17T07:12:06,802][INFO ][c.a.o.i.s.t.AttemptTransitionStep] [*****************] 5b5e6fd9-a333ed241903.prd-000187 transition conditions evaluated to true [indexCreationDate=1591168326309, numDocs=null, indexSize=null,stepStartTime=1591276926613]

...

[2020-06-17T07:17:06,620][INFO ][c.a.o.i.s.s.AttemptSnapshotStep] [*****************] Executing snapshot on 5b5e6fd9-a333ed241903.prd-000187
[2020-06-17T07:17:06,669][ERROR][c.a.o.i.s.s.AttemptSnapshotStep] [*****************] Failed to create snapshot for index: 5b5e6fd9-a333ed241903.prd-000187

...

[2020-06-17T07:19:10,316][INFO ][c.a.o.i.ManagedIndexRunner] [*****************] Previous execution failed to update step status, isIdempotent=true
[2020-06-17T07:19:10,426][INFO ][c.a.o.i.s.s.WaitForSnapshotStep] [*****************] Waiting for snapshot to complete...
[2020-06-17T07:19:10,482][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [*****************] uncaught exception in thread [DefaultDispatcher-worker-14]
org.elasticsearch.transport.RemoteTransportException: [******************][172.17.0.2:9300][cluster:admin/snapshot/status]
Caused by: org.elasticsearch.snapshots.SnapshotMissingException: [s3-repository:null] is missing
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.lambda$loadRepositoryData$3(TransportSnapshotsStatusAction.java:231) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:68) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.StepListener.whenComplete(StepListener.java:78) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.loadRepositoryData(TransportSnapshotsStatusAction.java:213) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.buildResponse(TransportSnapshotsStatusAction.java:202) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:105) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.masterOperation(TransportSnapshotsStatusAction.java:64) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:99) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:170) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]
[2020-06-17T07:19:18,248][INFO ][c.a.o.j.s.JobScheduler   ] [*****************] Descheduling jobId: h94rl6JgTWm5QFE_Y4ptHw

@dbbaughe
Copy link
Contributor

Hi @tocw,

It looks like the log was cutoff right where I wanted to see the error. There should be an exception right under this line:
[2020-06-17T07:17:06,669][ERROR][c.a.o.i.s.s.AttemptSnapshotStep] [*****************] Failed to create snapshot for index: 5b5e6fd9-a333ed241903.prd-000187

Although this log does let us see some other bugs so that's good!

There is no try/catch around the WaitForSnapshot step so we need to add that.

And It seems like the snapshotName is added into the info map from the AttemptSnapshot step here. And in the WaitForSnapshot step if it fails (and retries) or is still waiting for it to complete it seems like it will overwrite the info map e.g. here and lose the snapshotName. In which case in the next execution the name will be null and you'll get the error you see for the WaitForSnapshotStep in that log.

These should be relatively quick fixes for the WaitFor step, just need to add try/catch and move the snapshotName from info to the ActionProperties in the ActionMetaData using ForceMerge as an example.

Still need to see the exception for the AttemptSnapshot step though.

@tocw
Copy link
Author

tocw commented Jun 19, 2020

Hello,
sorry for the delay. Here is the missing part

[2020-06-19T08:44:37,865][INFO ][c.a.o.i.s.s.AttemptSnapshotStep] [els-hot-3.**************] Executing snapshot on 15a42509-a5d4-61ed7373d7e4.dev-000160
[2020-06-19T08:44:37,899][ERROR][c.a.o.i.s.s.AttemptSnapshotStep] [els-hot-3.**************] Failed to create snapshot for index: 15a42509-a5d4-61ed7373d7e4.dev-000160
org.elasticsearch.transport.RemoteTransportException: [els-master-1.**************][172.17.0.2:9300][cluster:admin/snapshot/create]
Caused by: org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [s3-repository:15a42509-a5d4-61ed7373d7e4.dev-2020.06.19-08:44:37.866] cannot snapshot while a snapshot deletion is in-progress in [SnapshotDeletionsInProgress[a5b8ee3b-9dbc-1f96bcbd6f69.uat-2020.05.11-08:13:01.910]]
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:290) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:835) ~[?:?]
[2020-06-19T08:44:51,897][INFO ][c.a.o.j.s.JobSweeper     ] [els-hot-3.**************] Descheduling job Nt2-tlIpSM-Hi_wghhHnFw on index .opendistro-ism-config
[2020-06-19T08:44:51,897][INFO ][c.a.o.j.s.JobScheduler   ] [els-hot-3.**************] Descheduling jobId: Nt2-tlIpSM-Hi_wghhHnFw

@dbbaughe
Copy link
Contributor

Hi @tocw,

I was able to reproduce this. It's not caught in our tests as our integration tests run a single node cluster and the jobs are run on the (only) master node. The error seems to be in multi node clusters where the job runs on a data node and does a transport request to the master node. The master node then throws an error like ConcurrentSnapshotExecutionException and wraps it in a RemoteTransportException to be returned to the caller node which then doesn't get caught by our catch (e: ConcurrentSnapshotExecutionException) { ... }.

Thanks for reporting, will push out a fix shortly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants