Abort snapshots on a node that leaves the cluster #21084

abeyad · 2016-10-23T03:08:46Z

Previously, if a node left the cluster (for example, due to a long GC),
during a snapshot, the master node would mark the snapshot as failed, but
the node itself could continue snapshotting the data on its shards to the
repository. If the node rejoins the cluster, the master may assign it to
hold the replica shard (where it held the primary before getting kicked off
the cluster). The initialization of the replica shard would repeatedly fail
with a ShardLockObtainFailedException until the snapshot thread finally
finishes and relinquishes the lock on the Store.

This commit resolves the situation by ensuring that when a shard is removed
from a node (such as when a node rejoins the cluster and realizes it no longer
holds the active shard copy), any snapshotting of the removed shards is aborted.
In the scenario above, when the node rejoins the cluster, it will see in the cluster
state that the node no longer holds the primary shard, so IndicesClusterStateService
will remove the shard, thereby causing any snapshots of that shard to be aborted.

Closes #20876

Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that the shard snapshot is aborted when the node responsible for that shard's snapshot leaves the cluster. When the node rejoins the cluster, it will see in the cluster state that the snapshot for that shard is failed and abort the snapshot locally, allowing the shard data directory to be freed for allocation of a replica shard on the same node. Closes elastic#20876

ywelsch

I wonder if the better change would be to treat aborting snapshots in the same way as we abort outgoing peer recoveries of a primary: by registering an IndexEventListener listening for beforeIndexShardClosed and cancelling recoveries / abort ongoing snapshots at that time. This ensures that snapshots are aborted whenever we close the shard, simplifying the logic here. WDYT?

ywelsch · 2016-10-24T10:36:45Z

core/src/main/java/org/elasticsearch/cluster/node/DiscoveryNodes.java

@@ -402,6 +402,10 @@ public String toString() {
        for (DiscoveryNode node : this) {
            sb.append(node).append(',');
        }
+        if (sb.length() > 1) {


maybe simpler to replace
for (DiscoveryNode node : this) { sb.append(node).append(','); }
by sb.append(Strings.collectionToDelimitedString(this, ","));

aborting, removed all the network disruption stuff

abeyad · 2016-10-24T16:12:06Z

@ywelsch the PR has been updated to use the beforeIndexShardClosed callback, the test has been made much more lightweight (doesn't use network disruptions to simulate the shard removal situation), and the PR description has been updated to reflect the new approach.

ywelsch

The change is good but the test needs a bit more work (I'm not sure it's testing the right thing).

ywelsch · 2016-10-25T17:20:49Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+        String nodeWithPrimary = clusterState.nodes().get(indexRoutingTable.shard(0).primaryShard().currentNodeId()).getName();
+        assertNotNull("should be at least one node with a primary shard", nodeWithPrimary);
+        IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeWithPrimary);
+        indicesService.deleteIndex(resolveIndex(index), "trigger shard removal");


removeIndex might be good enough here.

ywelsch · 2016-10-25T17:21:47Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+            if (snapshotsInProgress != null && snapshotsInProgress.entries().size() > 0) {
+                assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state());
+            }
+        }, 10, TimeUnit.SECONDS);


assertBusy uses by default 10 seconds, no need to specify it here again

ywelsch · 2016-10-25T17:22:51Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+            Settings.builder().put("number_of_shards", numPrimaries).put("number_of_replicas", numReplicas)));
+
+        logger.info("--> indexing some data");
+        Client client = client();


why not use a random client every time?

ywelsch · 2016-10-25T17:26:04Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+            SnapshotsInProgress snapshotsInProgress =
+                client.admin().cluster().prepareState().get().getState().custom(SnapshotsInProgress.TYPE);
+            if (snapshotsInProgress != null && snapshotsInProgress.entries().size() > 0) {
+                assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state());


I think this will succeed even without the change in this PR? I'm not sure what is exactly tested here.

Without the change here, the snapshot forever stalls and the test times out, because the snapshot was never aborted. This asserts that we abort the snapshot, bringing the snapshotting to a successful conclusion.

ywelsch · 2016-10-25T17:30:49Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

@@ -2490,4 +2493,80 @@ public void testGetSnapshotsRequest() throws Exception {
        waitForCompletion(repositoryName, inProgressSnapshot, TimeValue.timeValueSeconds(60));
    }

+    /**
+     * This test ensures that if a node that holds a primary that is being snapshotted leaves the cluster,
+     * when it returns, the node aborts the snapshotting on the now removed shard.


the description does not match what the test does.

ywelsch · 2016-10-25T17:31:35Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+        logger.info("--> waiting for snapshot to be in progress on all nodes");
+        assertBusy(() -> {
+            for (String node : internalCluster().nodesInclude(index)) {
+                final Client nodeClient = client(node);


why use this particular client?

I made a mistake here, this only ensures the snapshot cluster state update has reached master, so I changed it to use internalCluster().clusterService(node).state() instead, to ensure each node knows that the snapshot is in progress.

ywelsch · 2016-10-25T17:32:07Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+            }
+        }, 10, TimeUnit.SECONDS);
+
+        // Pick a node with a primary shard and remove the shard from the node


Pick node with THE primary shard

abeyad · 2016-10-25T18:37:13Z

@ywelsch I pushed 651bdbf that I think improves the test a lot. Instead of the throttling, it uses the mock repository with blocks, then unblocks after removing the shard, triggering the snapshot abort.

ywelsch

LGTM. I left two small suggestions. Thanks @abeyad

ywelsch · 2016-10-26T07:27:24Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

+        logger.info("--> waiting for snapshot to complete");
+        waitForCompletion(repo, snapshot, TimeValue.timeValueSeconds(10));
+
+        // make sure snapshot is aborted and the aborted shard was marked as failed
        assertBusy(() -> {


no assertBusy needed with waitForCompletion above?

ywelsch · 2016-10-26T07:28:17Z

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

-                assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state());
-            }
-        }, 10, TimeUnit.SECONDS);
+            List<SnapshotInfo> snapshotInfos = client().admin().cluster().prepareGetSnapshots(repo).setSnapshots(snapshot).get().getSnapshots();


waitForCompletion returns SnapshotInfo

abeyad · 2016-10-26T12:43:04Z

I pushed abd71f5 that uses SnapshotInfo returned from waitForCompletion. Thanks for the review @ywelsch !

Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876

abeyad · 2016-10-26T14:07:35Z

5.x commit: 1d278d2

Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876

ywelsch · 2016-11-22T10:03:52Z

5.0 commit: 22ee78c

abeyad added review resiliency :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.0.0-alpha1 v5.1.1 labels Oct 23, 2016

ywelsch reviewed Oct 24, 2016

View reviewed changes

Ali Beyad added 4 commits October 24, 2016 09:03

fix DiscoveryNodes#toString()

7177e7e

aborting shard snapshots now happens on beforeIndexShardClosed callback

8770264

remove unused import

e1429c2

more lightweight test focusing on shard removal guaranting snapshot

85abda3

aborting, removed all the network disruption stuff

remove extra newline introduced

93447b5

ywelsch suggested changes Oct 25, 2016

View reviewed changes

improve test

651bdbf

improves logging in test

625b144

ywelsch approved these changes Oct 26, 2016

View reviewed changes

Use SnapshotInfo from waitForCompletion

abd71f5

abeyad merged commit c88452d into elastic:master Oct 26, 2016

abeyad deleted the cancel_snapshot_on_shard_close branch October 26, 2016 14:04

clintongormley added >bug >enhancement and removed >bug labels Nov 5, 2016

abeyad mentioned this pull request Nov 12, 2016

Fixes potential NullPointerException on shard closing #21515

Merged

ywelsch added the v5.0.2 label Nov 22, 2016

abeyad mentioned this pull request Nov 23, 2016

Use CancellableThreads to abort snapshots #21759

Closed

ywelsch mentioned this pull request Apr 6, 2017

Red Cluster State: failed to obtain in-memory shard lock and closeShard NPE #23939

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort snapshots on a node that leaves the cluster #21084

Abort snapshots on a node that leaves the cluster #21084

abeyad commented Oct 23, 2016 •

edited

Loading

ywelsch left a comment

ywelsch Oct 24, 2016

abeyad Oct 24, 2016

abeyad commented Oct 24, 2016

ywelsch left a comment

ywelsch Oct 25, 2016

ywelsch Oct 25, 2016

abeyad Oct 25, 2016

ywelsch Oct 25, 2016

abeyad Oct 25, 2016

ywelsch Oct 25, 2016

abeyad Oct 25, 2016

ywelsch Oct 25, 2016

abeyad Oct 25, 2016

ywelsch Oct 25, 2016

abeyad Oct 25, 2016

ywelsch Oct 25, 2016

abeyad Oct 25, 2016

abeyad commented Oct 25, 2016

ywelsch left a comment

ywelsch Oct 26, 2016

ywelsch Oct 26, 2016

abeyad commented Oct 26, 2016

abeyad commented Oct 26, 2016

ywelsch commented Nov 22, 2016

Abort snapshots on a node that leaves the cluster #21084

Abort snapshots on a node that leaves the cluster #21084

Conversation

abeyad commented Oct 23, 2016 • edited Loading

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Oct 24, 2016

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Oct 25, 2016

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abeyad commented Oct 26, 2016

abeyad commented Oct 26, 2016

ywelsch commented Nov 22, 2016

abeyad commented Oct 23, 2016 •

edited

Loading