Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733

original-brownbear · 2021-07-27T12:42:05Z

The node executing a shard level operation would in many cases communicate null for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

This change fixes the logic on the side of the node executing the shard level operation and reproduces the issue in one case. The exact case in the linked issue is very hard to reproduce unfortunately because it required very specific timing that our test infra currently does easily enable). This should also be fixed in the master node logic (to make the code more obviously correct and fix mixed-version clusters better) but I'd like to delay that and do it in #75501 because of how tricky (as in lots of confusing code for clones and snapshots and so on) it would be to figure out the correct generation from the cluster-state without that refactoring.

closes #75598

…after Failing Operations The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes elastic#75598

elasticmachine · 2021-07-27T12:42:08Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-07-27T12:47:47Z

test/framework/src/main/java/org/elasticsearch/snapshots/AbstractSnapshotIntegTestCase.java

@@ -256,6 +256,10 @@ public static void blockDataNode(String repository, String nodeName) {
        AbstractSnapshotIntegTestCase.<MockRepository>getRepositoryOnNode(repository, nodeName).blockOnDataFiles();
    }

+    public static void blockAndFailDataNode(String repository, String nodeName) {


The fact that we didn't have this logic revealed an unfortunate lack of test coverage. We have a number of tests that simulate data-node failure but they're all based on blocking the data-node via the existing block-and-wait and then shutting down the blocked data nodes which triggers a very different code path on master.

DaveCTurner

Nice catch, I asked for some slightly tighter assertions if possible.

DaveCTurner · 2021-07-27T12:51:55Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                                localNodeId,
+                                ShardState.FAILED,
+                                "failed to clone shard snapshot",
+                                shardStatusBefore.generation()


Can we shift some of the @Nullable annotations and associated assertions around in ShardSnapshotStatus? With this change I think we might still receive a null generation over the wire from an older version, but we shouldn't be creating them afresh any more?

In fact we ought to change the wire format so it's no longer an OptionalString. I'm ok with not doing that here, it'll make the backport that much harder, a follow up is fine.

We would in fact still create them for pre-7.6 state machine operation (still a thing if there's an old snapshot in your repo) and we don't have access to the snapshot version in these constructors. In these null means (figure out the numeric generation yourself which would happen to a queued operation if e.g. the first operation for a shard in the CS fails).

Let me see what I can do about this though :)

Bad news here I'm afraid. I had to remove the assertion since it didn't hold up but it's also really hard to assert this elsewhere due to the BwC situation (we can't neatly do this in ShardSnapshotStatus without refactoring its construction and doing it elsewhere is tricky as well since it's so many places right now).
If it's ok with you I think I'd rather look for a cleaner way of asserting this stuff once #75501 has landed (or actually as part of incorporating this into that change) and just assert that we're not doing any illegal changes to SnapshotsInProgress like this any longer where non-null generation becomes null generation for a given shard (much easier if we don't have to hack around translating ShardId and RepoShardId all over the place)?

Sure, thanks for looking. Blasted BwC, always spoiling the fun.

…napshot-queuing

DaveCTurner

LGTM

original-brownbear · 2021-07-27T14:13:08Z

Thanks David!

…after Failing Operations (elastic#75733) The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes elastic#75598

…after Failing Operations (#75733) (#76548) The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes #75598

…after Failing Operations (elastic#75733) (elastic#76548) The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes elastic#75598

…after Failing Operations (#75733) (#76548) (#76556) The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes #75598

original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.14.1 v7.15.0 labels Jul 27, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 27, 2021

original-brownbear commented Jul 27, 2021

View reviewed changes

DaveCTurner reviewed Jul 27, 2021

View reviewed changes

original-brownbear added 2 commits July 27, 2021 14:56

Merge remote-tracking branch 'elastic/master' into fix-failed-shard-s…

6d608e3

…napshot-queuing

remove assertion

2b79dfa

DaveCTurner approved these changes Jul 27, 2021

View reviewed changes

original-brownbear merged commit f1ba7c4 into elastic:master Jul 27, 2021

original-brownbear deleted the fix-failed-shard-snapshot-queuing branch July 27, 2021 14:13

original-brownbear added the backport pending label Jul 27, 2021

jakelandis added v8.0.0-alpha1 v8.0.0 and removed v8.0.0 v8.0.0-alpha1 labels Jul 27, 2021

probakowski added v7.14.0 v7.14.1 and removed v7.14.1 v7.14.0 labels Jul 30, 2021

DaveCTurner mentioned this pull request Aug 4, 2021

Replace String shard gen with ShardGeneration #75927

Merged

mark-vieira added v8.0.0-alpha1 and removed v8.0.0 labels Aug 4, 2021

original-brownbear mentioned this pull request Aug 16, 2021

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations (#75733) #76548

Merged

original-brownbear mentioned this pull request Aug 16, 2021

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations (#75733) (#76548) #76556

Merged

original-brownbear removed the backport pending label Aug 16, 2021

original-brownbear restored the fix-failed-shard-snapshot-queuing branch April 18, 2023 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733

original-brownbear commented Jul 27, 2021

elasticmachine commented Jul 27, 2021

original-brownbear Jul 27, 2021

DaveCTurner left a comment

DaveCTurner Jul 27, 2021

DaveCTurner Jul 27, 2021

original-brownbear Jul 27, 2021

original-brownbear Jul 27, 2021 •

edited

Loading

DaveCTurner Jul 27, 2021

DaveCTurner left a comment

original-brownbear commented Jul 27, 2021

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733

Conversation

original-brownbear commented Jul 27, 2021

elasticmachine commented Jul 27, 2021

original-brownbear Jul 27, 2021

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Jul 27, 2021

Choose a reason for hiding this comment

DaveCTurner Jul 27, 2021

Choose a reason for hiding this comment

original-brownbear Jul 27, 2021

Choose a reason for hiding this comment

original-brownbear Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

DaveCTurner Jul 27, 2021

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

original-brownbear commented Jul 27, 2021

original-brownbear Jul 27, 2021 •

edited

Loading