-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort snapshots on a node that leaves the cluster #21084
Conversation
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that the shard snapshot is aborted when the node responsible for that shard's snapshot leaves the cluster. When the node rejoins the cluster, it will see in the cluster state that the snapshot for that shard is failed and abort the snapshot locally, allowing the shard data directory to be freed for allocation of a replica shard on the same node. Closes elastic#20876
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the better change would be to treat aborting snapshots in the same way as we abort outgoing peer recoveries of a primary: by registering an IndexEventListener
listening for beforeIndexShardClosed
and cancelling recoveries / abort ongoing snapshots at that time. This ensures that snapshots are aborted whenever we close the shard, simplifying the logic here. WDYT?
@@ -402,6 +402,10 @@ public String toString() { | |||
for (DiscoveryNode node : this) { | |||
sb.append(node).append(','); | |||
} | |||
if (sb.length() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe simpler to replace
for (DiscoveryNode node : this) { sb.append(node).append(','); }
by sb.append(Strings.collectionToDelimitedString(this, ","));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
aborting, removed all the network disruption stuff
@ywelsch the PR has been updated to use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is good but the test needs a bit more work (I'm not sure it's testing the right thing).
String nodeWithPrimary = clusterState.nodes().get(indexRoutingTable.shard(0).primaryShard().currentNodeId()).getName(); | ||
assertNotNull("should be at least one node with a primary shard", nodeWithPrimary); | ||
IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeWithPrimary); | ||
indicesService.deleteIndex(resolveIndex(index), "trigger shard removal"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeIndex might be good enough here.
if (snapshotsInProgress != null && snapshotsInProgress.entries().size() > 0) { | ||
assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state()); | ||
} | ||
}, 10, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertBusy uses by default 10 seconds, no need to specify it here again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Settings.builder().put("number_of_shards", numPrimaries).put("number_of_replicas", numReplicas))); | ||
|
||
logger.info("--> indexing some data"); | ||
Client client = client(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a random client every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
SnapshotsInProgress snapshotsInProgress = | ||
client.admin().cluster().prepareState().get().getState().custom(SnapshotsInProgress.TYPE); | ||
if (snapshotsInProgress != null && snapshotsInProgress.entries().size() > 0) { | ||
assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will succeed even without the change in this PR? I'm not sure what is exactly tested here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the change here, the snapshot forever stalls and the test times out, because the snapshot was never aborted. This asserts that we abort the snapshot, bringing the snapshotting to a successful conclusion.
@@ -2490,4 +2493,80 @@ public void testGetSnapshotsRequest() throws Exception { | |||
waitForCompletion(repositoryName, inProgressSnapshot, TimeValue.timeValueSeconds(60)); | |||
} | |||
|
|||
/** | |||
* This test ensures that if a node that holds a primary that is being snapshotted leaves the cluster, | |||
* when it returns, the node aborts the snapshotting on the now removed shard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the description does not match what the test does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
logger.info("--> waiting for snapshot to be in progress on all nodes"); | ||
assertBusy(() -> { | ||
for (String node : internalCluster().nodesInclude(index)) { | ||
final Client nodeClient = client(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use this particular client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a mistake here, this only ensures the snapshot cluster state update has reached master, so I changed it to use internalCluster().clusterService(node).state()
instead, to ensure each node knows that the snapshot is in progress.
} | ||
}, 10, TimeUnit.SECONDS); | ||
|
||
// Pick a node with a primary shard and remove the shard from the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pick node with THE primary shard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left two small suggestions. Thanks @abeyad
logger.info("--> waiting for snapshot to complete"); | ||
waitForCompletion(repo, snapshot, TimeValue.timeValueSeconds(10)); | ||
|
||
// make sure snapshot is aborted and the aborted shard was marked as failed | ||
assertBusy(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no assertBusy needed with waitForCompletion
above?
assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state()); | ||
} | ||
}, 10, TimeUnit.SECONDS); | ||
List<SnapshotInfo> snapshotInfos = client().admin().cluster().prepareGetSnapshots(repo).setSnapshots(snapshot).get().getSnapshots(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForCompletion returns SnapshotInfo
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876
5.x commit: 1d278d2 |
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876
5.0 commit: 22ee78c |
Previously, if a node left the cluster (for example, due to a long GC),
during a snapshot, the master node would mark the snapshot as failed, but
the node itself could continue snapshotting the data on its shards to the
repository. If the node rejoins the cluster, the master may assign it to
hold the replica shard (where it held the primary before getting kicked off
the cluster). The initialization of the replica shard would repeatedly fail
with a ShardLockObtainFailedException until the snapshot thread finally
finishes and relinquishes the lock on the Store.
This commit resolves the situation by ensuring that when a shard is removed
from a node (such as when a node rejoins the cluster and realizes it no longer
holds the active shard copy), any snapshotting of the removed shards is aborted.
In the scenario above, when the node rejoins the cluster, it will see in the cluster
state that the node no longer holds the primary shard, so
IndicesClusterStateService
will remove the shard, thereby causing any snapshots of that shard to be aborted.
Closes #20876