-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733
Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations #75733
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -256,6 +256,10 @@ public static void blockDataNode(String repository, String nodeName) { | |
AbstractSnapshotIntegTestCase.<MockRepository>getRepositoryOnNode(repository, nodeName).blockOnDataFiles(); | ||
} | ||
|
||
public static void blockAndFailDataNode(String repository, String nodeName) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The fact that we didn't have this logic revealed an unfortunate lack of test coverage. We have a number of tests that simulate data-node failure but they're all based on blocking the data-node via the existing block-and-wait and then shutting down the blocked data nodes which triggers a very different code path on master. |
||
AbstractSnapshotIntegTestCase.<MockRepository>getRepositoryOnNode(repository, nodeName).blockAndFailOnDataFiles(); | ||
} | ||
|
||
public static void blockAllDataNodes(String repository) { | ||
for (RepositoriesService repositoriesService : internalCluster().getDataNodeInstances(RepositoriesService.class)) { | ||
((MockRepository) repositoriesService.repository(repository)).blockOnDataFiles(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we shift some of the
@Nullable
annotations and associated assertions around inShardSnapshotStatus
? With this change I think we might still receive a null generation over the wire from an older version, but we shouldn't be creating them afresh any more?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact we ought to change the wire format so it's no longer an
OptionalString
. I'm ok with not doing that here, it'll make the backport that much harder, a follow up is fine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would in fact still create them for pre-7.6 state machine operation (still a thing if there's an old snapshot in your repo) and we don't have access to the snapshot version in these constructors. In these
null
means (figure out the numeric generation yourself which would happen to a queued operation if e.g. the first operation for a shard in the CS fails).Let me see what I can do about this though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad news here I'm afraid. I had to remove the assertion since it didn't hold up but it's also really hard to assert this elsewhere due to the BwC situation (we can't neatly do this in
ShardSnapshotStatus
without refactoring its construction and doing it elsewhere is tricky as well since it's so many places right now).If it's ok with you I think I'd rather look for a cleaner way of asserting this stuff once #75501 has landed (or actually as part of incorporating this into that change) and just assert that we're not doing any illegal changes to
SnapshotsInProgress
like this any longer where non-null generation becomesnull
generation for a given shard (much easier if we don't have to hack around translatingShardId
andRepoShardId
all over the place)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for looking. Blasted BwC, always spoiling the fun.