-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Snapshot State Machine Issues around Failed Clones #76419
Fix Snapshot State Machine Issues around Failed Clones #76419
Conversation
With recent fixes it is never correct to simply remove a snapshot from the cluster state without updating other snapshot entries if an entry contains any successful shards due to possible dependencies. This change reproduces two issues resulting from simply removing snapshot without regard for other queued operations and fixes them by having all removal of snapshot from the cluster state go through the same code path. Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing snapshots that forces them to be in this collection.
Pinging @elastic/es-distributed (Team:Distributed) |
if (entry.isClone() && entry.state() == State.FAILED) { | ||
logger.debug("Removing failed snapshot clone [{}] from cluster state", entry); | ||
removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null); | ||
if (newFinalization) { | ||
removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do a possible follow-up here to remove data from partially failed clones from the repo. Since clones only really fail like this due to IO exceptions and since IO exceptions are really unlikely unless something is really broken I'm not sure it's worth the effort though because the cleanup will probably also fail (and would happen on a subsequent delete run). Since we aren't writing a new index-N
(so this isn't relevant to the scalability of the repo really) and clones by their very nature add almost no bytes to the repo I think for now this is good enough.
blockOnWriteShardLevelMeta = true; | ||
} | ||
|
||
public void setBlockAndFailOnWriteShardLevelMeta() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not great to add yet another method to this, but this case turns out to be important I guess. Reproducing this like we used to via blocking and then restarting master simply did not cover this obviously broken spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but a question about test coverage
if (entry.isClone() && entry.state() == State.FAILED) { | ||
logger.debug("Removing failed snapshot clone [{}] from cluster state", entry); | ||
removeFailedSnapshotFromClusterState(snapshot, new SnapshotException(snapshot, entry.failure()), null); | ||
if (newFinalization) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test for this new condition? I deleted it and ran a few likely-looking suites and also all of :server:ictest
but didn't see any failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not, I'm not sure how I could deterministically reproduce this at the moment. You can technically get here in some master fails over twice in a row scenarios. Though that would need to happen concurrently with finalizing a snapshot queued before the clone ... I'll think about a way to set this up, just added this here for now to be defensive and not create pointless CS updates (technically these updates should be idempotent anyway because they become noops as soon as the snapshot to be removed isn't in the CS).
Thanks David! |
With recent fixes it is never correct to simply remove a snapshot from the cluster state without updating other snapshot entries if an entry contains any successful shards due to possible dependencies. This change reproduces two issues resulting from simply removing snapshot without regard for other queued operations and fixes them by having all removal of snapshot from the cluster state go through the same code path. Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing snapshots that forces them to be in this collection.
With recent fixes it is never correct to simply remove a snapshot from the cluster state without updating other snapshot entries if an entry contains any successful shards due to possible dependencies. This change reproduces two issues resulting from simply removing snapshot without regard for other queued operations and fixes them by having all removal of snapshot from the cluster state go through the same code path. Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing snapshots that forces them to be in this collection.
With recent fixes it is never correct to simply remove a snapshot from the cluster state without updating other snapshot entries if an entry contains any successful shards due to possible dependencies. This change reproduces two issues resulting from simply removing snapshot without regard for other queued operations and fixes them by having all removal of snapshot from the cluster state go through the same code path. Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing snapshots that forces them to be in this collection.
With recent fixes it is never correct to simply remove a snapshot from the cluster state without updating other snapshot entries if an entry contains any successful shards due to possible dependencies. This change reproduces two issues resulting from simply removing snapshot without regard for other queued operations and fixes them by having all removal of snapshot from the cluster state go through the same code path. Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing snapshots that forces them to be in this collection.
With recent fixes it is never correct to simply remove a snapshot from the cluster state without
updating other snapshot entries if an entry contains any successful shards due to possible dependencies.
This change reproduces two issues resulting from simply removing snapshot without regard for other queued
operations and fixes them by having all removal of snapshot from the cluster state go through the same
code path.
Also, this change moves the tracking of a snapshot as "ending" up a few lines to fix an assertion about finishing
snapshots that forces them to be in this collection.