-
Notifications
You must be signed in to change notification settings - Fork 45
Failed to create snapshot [BUG] #240
Comments
Hi @tocw, Do you happen to know if it's failing during the AttemptSnapshotStep or the WaitForSnapshotStep? Could you post the explain API response for the affected indices? |
Hello @dbbaughe Here is the response from explain API, each failed index returns the same response
|
Hi @tocw, It looks like it's the AttemptSnapshotStep. Your message I do see this log right below which should be logging out the exact error that it's from. Do you see anywhere in your elasticsearch.log an exception that has a line above it with |
Hello @dbbaughe ,
|
Hi @tocw, It looks like the log was cutoff right where I wanted to see the error. There should be an exception right under this line: Although this log does let us see some other bugs so that's good! There is no try/catch around the WaitForSnapshot step so we need to add that. And It seems like the snapshotName is added into the info map from the AttemptSnapshot step here. And in the WaitForSnapshot step if it fails (and retries) or is still waiting for it to complete it seems like it will overwrite the info map e.g. here and lose the snapshotName. In which case in the next execution the name will be null and you'll get the error you see for the WaitForSnapshotStep in that log. These should be relatively quick fixes for the WaitFor step, just need to add try/catch and move the snapshotName from info to the ActionProperties in the ActionMetaData using ForceMerge as an example. Still need to see the exception for the AttemptSnapshot step though. |
Hello,
|
Hi @tocw, I was able to reproduce this. It's not caught in our tests as our integration tests run a single node cluster and the jobs are run on the (only) master node. The error seems to be in multi node clusters where the job runs on a data node and does a transport request to the master node. The master node then throws an error like Thanks for reporting, will push out a fix shortly. |
Hello,
We are running Open Distro release v1.8.0.
The problem I observe is that ISM fails on "create snapshot" action for random indices.
Actually the snapshot is created but ISM fails to complete the action and does not delete the index in the next step.
If I retry failed actionsonly a couple of them at most is able to finish.
Here is the only log message connected to that problem:
Here is the policy
Wasn't this supposed to be fixed by PR #172
The text was updated successfully, but these errors were encountered: