-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-17506 KRaftMigrationDriver initialization race #17147
KAFKA-17506 KRaftMigrationDriver initialization race #17147
Conversation
This was found by investigating flaky ZkMigrationFailoverTest tests. Here is an example of a flaky failure where the driver never finished initializing https://ge.apache.org/s/6ymwuwglpcrza/tests/task/:core:test/details/kafka.zk.ZkMigrationFailoverTest/testDriverSkipsEventsFromOlderEpoch()%5B8%5D?top-execution=1 Here is a 40x run of the test with the changes in this PR https://github.com/apache/kafka/actions/runs/10784300492#summary-29907682759 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. The change looks good to me. But I think we need to add test for them.
@@ -519,7 +522,7 @@ public void run() throws Exception { | |||
KRaftMigrationDriver.this.image = image; | |||
String metadataType = isSnapshot ? "snapshot" : "delta"; | |||
|
|||
if (migrationState.equals(MigrationDriverState.INACTIVE)) { | |||
if (EnumSet.of(MigrationDriverState.UNINITIALIZED, MigrationDriverState.INACTIVE).contains(migrationState)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice! Not only onControllerChange
, but all OnMetadataChange
will cause race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this race is harmless since onMetadataChange
does not alter the state machine state and most of its logic is guarded behind if (!migrationState.allowDualWrite()) {
. But it's good to have this for completeness.
metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java
Outdated
Show resolved
Hide resolved
Thanks for taking a look @showuon! I've added a unit test for the fixed race condition. Without the fix, this test reliably fails for me locally. |
…rationDriver-init-race
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is the blocker for v3.9.0, I think we can merge it soon and address the minor comment afterward.
// enqueue a poll event. once run, this will enqueue a recovery event | ||
driver.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we can invoke driver.start
after driver.onControllerChange
, so that we can reliably make sure the test goes with what we expected, that is: UNINITIALIZED state -> onControllerChange
-> RecoverMigrationStateFromZKEvent
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let's consider the possibilities:
Case 1: If the PollEvent runs quickly, we'll see
- start(), enqueue PollEvent
- run PollEvent, enqueue RecoverMigrationStateFromZKEvent (from PollEvent)
- onControllerChange(), enqueue RecoverMigrationStateFromZKEvent and KRaftLeaderEvent
- run RecoverMigrationStateFromZKEvent, state = INACTIVE
- run RecoverMigrationStateFromZKEvent, no-op
- run KRaftLeaderEvent, state is INACTIVE ✅
Case 2: If the RecoverMigrationStateFromZKEvent from PollEvent runs before onControllerChange:
- start(), enqueue PollEvent
- run PollEvent, enqueue RecoverMigrationStateFromZKEvent (from PollEvent)
- run RecoverMigrationStateFromZKEvent, state = INACTIVE
- onControllerChange(), enqueue KRaftLeaderEvent
- run KRaftLeaderEvent, state is INACTIVE ✅
Only Case 1 tests the race condition and the fix.
If we reverse the operations, we'll have
Case 3:
- onControllerChange(), enqueue RecoverMigrationStateFromZKEvent and KRaftLeaderEvent
- start(), enqueue PollEvent
- run RecoverMigrationStateFromZKEvent, state = INACTIVE
- run KRaftLeaderEvent, state is INACTIVE ✅
- run PollEvent, no-op
So yea, it makes sense to use Case 3 for the test since it always contrives the race condition.
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state. This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored. Reviewers: Luke Chen <[email protected]>
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state. This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored. Reviewers: Luke Chen <[email protected]>
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state. This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored. Reviewers: Luke Chen <[email protected]>
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state. This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored. Reviewers: Luke Chen <[email protected]>
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If
onControllerChange
is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state.This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored.