Fix snapshot getting stuck in INIT state #27214

imotov · 2017-11-01T20:31:12Z

If the master disconnects from the cluster after initiating snapshot, but just before the snapshot switches from INIT to STARTED state, the snapshot can get indefinitely stuck in the INIT state. This error is specific to v5.x+ and was triggered by keeping the master node that stepped down in the node list, the cleanup logic in snapshot/restore assumed that if master steps down it is always removed from the the node list. This commit changes the logic to trigger cleanup even if no nodes left the cluster.

Closes #27180

If the master disconnects from the cluster after initiating snapshot, but just before the snapshot switches from INIT to STARTED state, the snapshot can get indefinitely stuck in the INIT state. This error is specific to v5.x+ and was triggered by keeping the master node that stepped down in the node list, the cleanup logic in snapshot/restore assumed that if master steps down it is always removed from the the node list. This commit changes the logic to trigger cleanup even if no nodes left the cluster. Closes elastic#27180

ywelsch

Overall change LGTM. I've left one question and two minor comments.

ywelsch · 2017-11-02T12:11:40Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -1014,6 +1028,11 @@ public void onFailure(String source, Exception e) {
            }

            @Override
+            public void onNoLongerMaster(String source) {
+                listener.onNoLongerMaster(source);


can listener be null here? It is marked as Nullable above?

Yes! Good catch! Fixing it.

ywelsch · 2017-11-02T12:24:13Z

core/src/test/java/org/elasticsearch/discovery/SnapshotDisruptionIT.java

+            client().admin().cluster().prepareCreateSnapshot("test-repo", "test-snap-2").setWaitForCompletion(false)
+                .setMasterNodeTimeout("5s").setIndices(idxName).get();
+        } catch (Exception ex) {
+            logger.info("--> got exception from hanged master", ex);


should this be a timeout exception. can you assert on that?

Good point.

ywelsch · 2017-11-02T12:28:13Z

core/src/test/java/org/elasticsearch/discovery/SnapshotDisruptionIT.java

+                .put("compress", randomBoolean())
+                .put("chunk_size", randomIntBetween(100, 1000), ByteSizeUnit.BYTES)));
+
+        // Writing incompatible snapshot can cause this test to fail due to a race condition in repo initialization


I don't understand this comment. What's the issue with repo initialization here? When the disruption triggers, then there is no more repo writing done by the old master node AFAICS?

There is a race condition in writing a list of incompatible snapshot in getRepositoryData method that occurs on empty repositories. This method is called in the START phase on the former master and during clean up on the new master if this file doesn't exist in the repo, which happens in the repo. It shouldn't cause any issues in the real life, but it makes test to fail occasionally due to asserts. We will definitely need to address it at some point of time, but I don't think we should do it as part of this PR.

ywelsch

LGTM. Let's have this run for a bit on CI on master before backporting.

* es/master: Fix snapshot getting stuck in INIT state (#27214) Add an example of dynamic field names (#27255) #26260 Allow ip_range to accept CIDR notation (#27192) #27189 Fixed rounding of bounds in scaled float comparison (#27207) Add support for Gradle 4.3 (#27249) Fixes QueryStringQueryBuilderTests build: Fix setting the incorrect bwc version in mixed cluster qa module [Test] Fix QueryStringQueryBuilderTests.testExistsFieldQuery BWC Adjust assertions for sequence numbers BWC tests Do not create directories if repository is readonly (#26909) [Test] Fix InternalStatsTests [Test] Fix QueryStringQueryBuilderTests.testExistsFieldQuery Uses norms for exists query if enabled (#27237) Reinstate recommendation for ≥ 3 master-eligible nodes. (#27204)

If the master disconnects from the cluster after initiating snapshot, but just before the snapshot switches from INIT to STARTED state, the snapshot can get indefinitely stuck in the INIT state. This error is specific to v5.x+ and was triggered by keeping the master node that stepped down in the node list, the cleanup logic in snapshot/restore assumed that if master steps down it is always removed from the the node list. This commit changes the logic to trigger cleanup even if no nodes left the cluster. Closes #27180

* 6.x: Update Tika version to 1.15 Introduce templating support to timezone/locale in DateProcessor (#27089) Increase logging on qa:mixed-cluster tests Update to AWS SDK 1.11.223 (#27278) Improve error message for parse failures of completion fields (#27297) Remove optimisations to reuse objects when applying a new `ClusterState` (#27317) Decouple `ChannelFactory` from Tcp classes (#27286) Use PlainListenableActionFuture for CloseFuture (#26242) Fix find remote when building BWC Remove colons from task and configuration names Fix snapshot getting stuck in INIT state (#27214) Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (#26844) Add unreleased 5.6.5 version number testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards` Correct comment in index shard test Roll translog generation on primary promotion ObjectParser: Replace IllegalStateException with ParsingException (#27302) scripted_metric _agg parameter disappears if params are provided (#27159) Update discovery-ec2.asciidoc

…snapshot With the current snapshot/restore logic, a newly created snapshot is added by the SnapshotService.createSnapshot() method as a SnapshotInProgress object in the cluster state. This snapshot has the INIT state. Once the cluster state update is processed, the beginSnapshot() method is executed using the SNAPSHOT thread pool. The beginSnapshot() method starts the initialization of the snapshot using the initializeSnapshot() method. This method reads the repository data and then writes the global metadata file and an index metadata file per index to be snapshotted. These operations can take some time to be completed (many minutes). At this stage and if the master node is disconnected the snapshot can be stucked in INIT state on versions 5.6.4/6.0.0 or lower (pull request elastic#27214 fixed this on 5.6.5/6.0.1 and higher). If the snapshot is not stucked but the initialization takes some time and the user decides to abort the snapshot, a delete snapshot request can sneak in. The deletion updates the cluster state to check the state of the SnapshotInProgress. When the snapshot is in INIT, it executes the endSnapshot() method (which returns immediately) and then the snapshot's state is updated to ABORTED in the cluster state. The deletion will then listen for the snapshot completion in order to continue with the deletion of the snapshot. But before returning, the endSnapshot() method added a new Runnable to the SNAPSHOT thread pool that forces the finalization of the initializing snapshot. This finalization writes the snapshot metadata file and updates the index-N file in the repository. At this stage two things can potentially be executed concurrently: the initialization of the snapshot and the finalization of the snapshot. When the initializeSnapshot() is terminated, the cluster state is updated to start the snapshot and to move it to the STARTED state (this is before elastic#27931 which prevents an ABORTED snapshot to be started at all). The snapshot is started and shards start to be snapshotted but they quickly fail because the snapshot was ABORTED by the deletion. All shards are reported as FAILED to the master node, which executes endSnapshot() too (using SnapshotStateExecutor). Then many things can happen, depending on the execution of tasks by the SNAPSHOT thread pool and the time taken by each read/write/delete operation by the repository implementation. Especially on S3, where operations can take time (disconnections, retries, timeouts) and where the data consistency model allows to read old data or requires some time for objects to be replicated. Here are some scenario seen in cluster logs: a) the snapshot is finalized by the snapshot deletion. Snapshot metadata file exists in the repository so the future finalization by the snapshot creation will fail with a "fail to finalize snapshot" message in logs. Deletion process continues. b) the snapshot is finalized by the snapshot creation. Snapshot metadata file exists in the repository so the future finalization by the snapshot deletion will fail with a "fail to finalize snapshot" message in logs. Deletion process continues. c) both finalizations are executed concurrently, things can fail at different read or write operations. Shards failures can be lost as well as final snapshot state, depending on which SnapshotInProgress.Entry is used to finalize the snapshot. d) the snapshot is finalized by the snapshot deletion, the snapshot in progress is removed from the cluster state, triggering the execution of the completion listeners. The deletion process continues and the deleteSnapshotFromRepository() is executed using the SNAPSHOT thread pool. This method reads the repository data, the snapshot metadata and the index metadata for all indices included in the snapshot before updated the index-N file from the repository. It can also take some time and I think these operations could potentially be executed concurrently with the finalization of the snapshot by the snapshot creation, leading to corrupted data. This commit does not solve all the issues reported here, but it removes the finalization of the snapshot by the snapshot deletion. This way, the deletion marks the snapshot as ABORTED in cluster state and waits for the snapshot completion. It is the responsability of the snapshot execution to detect the abortion and terminates itself correctly. This avoids concurrent snapshot finalizations and also ordinates the operations: the deletion aborts the snapshot and waits for the snapshot completion, the creation detects the abortion and stops by itself and finalizes the snapshot, then the deletion resumes and continues the deletion process.

imotov added 2 commits November 1, 2017 11:00

Improve handling of clean up on the disconnected master node

c022d09

imotov added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug v5.6.4 v6.0.1 v6.1.0 v7.0.0 labels Nov 1, 2017

imotov requested a review from ywelsch November 1, 2017 20:31

ywelsch suggested changes Nov 2, 2017

View reviewed changes

Address @ywelsch's comments

b7eb257

ywelsch approved these changes Nov 3, 2017

View reviewed changes

imotov merged commit 117f0f3 into elastic:master Nov 3, 2017

imotov added the backport pending label Nov 3, 2017

imotov added v5.6.5 and removed backport pending v5.6.4 labels Nov 8, 2017

tlrx mentioned this pull request Jan 4, 2018

Avoid concurrent snapshot finalizations when deleting an INIT snapshot #28078

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

original-brownbear mentioned this pull request Mar 5, 2019

Wipe Snapshots Before Indices in RestTests #39662

Merged

imotov deleted the issue-27180-fix-snapshot-stuck-in-init-state branch May 1, 2020 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix snapshot getting stuck in INIT state #27214

Fix snapshot getting stuck in INIT state #27214

imotov commented Nov 1, 2017

ywelsch left a comment

ywelsch Nov 2, 2017

imotov Nov 2, 2017

ywelsch Nov 2, 2017

imotov Nov 2, 2017

ywelsch Nov 2, 2017

imotov Nov 2, 2017

ywelsch left a comment

Fix snapshot getting stuck in INIT state #27214

Fix snapshot getting stuck in INIT state #27214

Conversation

imotov commented Nov 1, 2017

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Nov 2, 2017

Choose a reason for hiding this comment

imotov Nov 2, 2017

Choose a reason for hiding this comment

ywelsch Nov 2, 2017

Choose a reason for hiding this comment

imotov Nov 2, 2017

Choose a reason for hiding this comment

ywelsch Nov 2, 2017

Choose a reason for hiding this comment

imotov Nov 2, 2017

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment