-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733
Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733
Conversation
…after Failing Operations The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes elastic#75598
Pinging @elastic/es-distributed (Team:Distributed) |
@@ -256,6 +256,10 @@ public static void blockDataNode(String repository, String nodeName) { | |||
AbstractSnapshotIntegTestCase.<MockRepository>getRepositoryOnNode(repository, nodeName).blockOnDataFiles(); | |||
} | |||
|
|||
public static void blockAndFailDataNode(String repository, String nodeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we didn't have this logic revealed an unfortunate lack of test coverage. We have a number of tests that simulate data-node failure but they're all based on blocking the data-node via the existing block-and-wait and then shutting down the blocked data nodes which triggers a very different code path on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I asked for some slightly tighter assertions if possible.
localNodeId, | ||
ShardState.FAILED, | ||
"failed to clone shard snapshot", | ||
shardStatusBefore.generation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we shift some of the @Nullable
annotations and associated assertions around in ShardSnapshotStatus
? With this change I think we might still receive a null generation over the wire from an older version, but we shouldn't be creating them afresh any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact we ought to change the wire format so it's no longer an OptionalString
. I'm ok with not doing that here, it'll make the backport that much harder, a follow up is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would in fact still create them for pre-7.6 state machine operation (still a thing if there's an old snapshot in your repo) and we don't have access to the snapshot version in these constructors. In these null
means (figure out the numeric generation yourself which would happen to a queued operation if e.g. the first operation for a shard in the CS fails).
Let me see what I can do about this though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad news here I'm afraid. I had to remove the assertion since it didn't hold up but it's also really hard to assert this elsewhere due to the BwC situation (we can't neatly do this in ShardSnapshotStatus
without refactoring its construction and doing it elsewhere is tricky as well since it's so many places right now).
If it's ok with you I think I'd rather look for a cleaner way of asserting this stuff once #75501 has landed (or actually as part of incorporating this into that change) and just assert that we're not doing any illegal changes to SnapshotsInProgress
like this any longer where non-null generation becomes null
generation for a given shard (much easier if we don't have to hack around translating ShardId
and RepoShardId
all over the place)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for looking. Blasted BwC, always spoiling the fun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks David! |
…after Failing Operations (elastic#75733) The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes elastic#75598
…after Failing Operations (elastic#75733) The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes elastic#75598
…after Failing Operations (elastic#75733) (elastic#76548) The node executing a shard level operation would in many cases communicate `null` for the shard state update, leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch. closes elastic#75598
The node executing a shard level operation would in many cases communicate
null
for the shard state update,leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.
This change fixes the logic on the side of the node executing the shard level operation and reproduces the issue in one case. The exact case in the linked issue is very hard to reproduce unfortunately because it required very specific timing that our test infra currently does easily enable). This should also be fixed in the master node logic (to make the code more obviously correct and fix mixed-version clusters better) but I'd like to delay that and do it in #75501 because of how tricky (as in lots of confusing code for clones and snapshots and so on) it would be to figure out the correct generation from the cluster-state without that refactoring.
closes #75598