Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733

Conversation

original-brownbear
Copy link
Member

The node executing a shard level operation would in many cases communicate null for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

This change fixes the logic on the side of the node executing the shard level operation and reproduces the issue in one case. The exact case in the linked issue is very hard to reproduce unfortunately because it required very specific timing that our test infra currently does easily enable). This should also be fixed in the master node logic (to make the code more obviously correct and fix mixed-version clusters better) but I'd like to delay that and do it in #75501 because of how tricky (as in lots of confusing code for clones and snapshots and so on) it would be to figure out the correct generation from the cluster-state without that refactoring.

closes #75598

…after Failing Operations

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes elastic#75598
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@@ -256,6 +256,10 @@ public static void blockDataNode(String repository, String nodeName) {
AbstractSnapshotIntegTestCase.<MockRepository>getRepositoryOnNode(repository, nodeName).blockOnDataFiles();
}

public static void blockAndFailDataNode(String repository, String nodeName) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we didn't have this logic revealed an unfortunate lack of test coverage. We have a number of tests that simulate data-node failure but they're all based on blocking the data-node via the existing block-and-wait and then shutting down the blocked data nodes which triggers a very different code path on master.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I asked for some slightly tighter assertions if possible.

localNodeId,
ShardState.FAILED,
"failed to clone shard snapshot",
shardStatusBefore.generation()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we shift some of the @Nullable annotations and associated assertions around in ShardSnapshotStatus? With this change I think we might still receive a null generation over the wire from an older version, but we shouldn't be creating them afresh any more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we ought to change the wire format so it's no longer an OptionalString. I'm ok with not doing that here, it'll make the backport that much harder, a follow up is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would in fact still create them for pre-7.6 state machine operation (still a thing if there's an old snapshot in your repo) and we don't have access to the snapshot version in these constructors. In these null means (figure out the numeric generation yourself which would happen to a queued operation if e.g. the first operation for a shard in the CS fails).

Let me see what I can do about this though :)

Copy link
Member Author

@original-brownbear original-brownbear Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad news here I'm afraid. I had to remove the assertion since it didn't hold up but it's also really hard to assert this elsewhere due to the BwC situation (we can't neatly do this in ShardSnapshotStatus without refactoring its construction and doing it elsewhere is tricky as well since it's so many places right now).
If it's ok with you I think I'd rather look for a cleaner way of asserting this stuff once #75501 has landed (or actually as part of incorporating this into that change) and just assert that we're not doing any illegal changes to SnapshotsInProgress like this any longer where non-null generation becomes null generation for a given shard (much easier if we don't have to hack around translating ShardId and RepoShardId all over the place)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for looking. Blasted BwC, always spoiling the fun.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

Thanks David!

@original-brownbear original-brownbear merged commit f1ba7c4 into elastic:master Jul 27, 2021
@original-brownbear original-brownbear deleted the fix-failed-shard-snapshot-queuing branch July 27, 2021 14:13
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request Jul 30, 2021
…after Failing Operations (elastic#75733)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes elastic#75598
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 16, 2021
…after Failing Operations (elastic#75733)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes elastic#75598
original-brownbear added a commit that referenced this pull request Aug 16, 2021
…after Failing Operations (#75733) (#76548)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes #75598
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 16, 2021
…after Failing Operations (elastic#75733) (elastic#76548)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes elastic#75598
original-brownbear added a commit that referenced this pull request Aug 16, 2021
…after Failing Operations (#75733) (#76548) (#76556)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes #75598
@original-brownbear original-brownbear restored the fix-failed-shard-snapshot-queuing branch April 18, 2023 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.14.1 v7.15.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aborting a Snapshot Queued after a Finalizing Snapshot is Broken
6 participants