Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-17506 KRaftMigrationDriver initialization race #17147

Merged
merged 10 commits into from
Sep 11, 2024

Conversation

mumrah
Copy link
Member

@mumrah mumrah commented Sep 10, 2024

There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state.

This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored.

@mumrah mumrah added the kraft label Sep 10, 2024
@mumrah
Copy link
Member Author

mumrah commented Sep 10, 2024

This was found by investigating flaky ZkMigrationFailoverTest tests. Here is an example of a flaky failure where the driver never finished initializing https://ge.apache.org/s/6ymwuwglpcrza/tests/task/:core:test/details/kafka.zk.ZkMigrationFailoverTest/testDriverSkipsEventsFromOlderEpoch()%5B8%5D?top-execution=1

Here is a 40x run of the test with the changes in this PR https://github.com/apache/kafka/actions/runs/10784300492#summary-29907682759

@mumrah mumrah requested a review from cmccabe September 10, 2024 01:39
Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The change looks good to me. But I think we need to add test for them.

@@ -519,7 +522,7 @@ public void run() throws Exception {
KRaftMigrationDriver.this.image = image;
String metadataType = isSnapshot ? "snapshot" : "delta";

if (migrationState.equals(MigrationDriverState.INACTIVE)) {
if (EnumSet.of(MigrationDriverState.UNINITIALIZED, MigrationDriverState.INACTIVE).contains(migrationState)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice! Not only onControllerChange, but all OnMetadataChange will cause race condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this race is harmless since onMetadataChange does not alter the state machine state and most of its logic is guarded behind if (!migrationState.allowDualWrite()) {. But it's good to have this for completeness.

@mumrah
Copy link
Member Author

mumrah commented Sep 10, 2024

Thanks for taking a look @showuon! I've added a unit test for the fixed race condition. Without the fix, this test reliably fails for me locally.

@mumrah mumrah requested a review from showuon September 10, 2024 13:25
Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is the blocker for v3.9.0, I think we can merge it soon and address the minor comment afterward.

Comment on lines 250 to 251
// enqueue a poll event. once run, this will enqueue a recovery event
driver.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can invoke driver.start after driver.onControllerChange, so that we can reliably make sure the test goes with what we expected, that is: UNINITIALIZED state -> onControllerChange -> RecoverMigrationStateFromZKEvent.

Copy link
Member Author

@mumrah mumrah Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let's consider the possibilities:

Case 1: If the PollEvent runs quickly, we'll see

  • start(), enqueue PollEvent
  • run PollEvent, enqueue RecoverMigrationStateFromZKEvent (from PollEvent)
  • onControllerChange(), enqueue RecoverMigrationStateFromZKEvent and KRaftLeaderEvent
  • run RecoverMigrationStateFromZKEvent, state = INACTIVE
  • run RecoverMigrationStateFromZKEvent, no-op
  • run KRaftLeaderEvent, state is INACTIVE ✅

Case 2: If the RecoverMigrationStateFromZKEvent from PollEvent runs before onControllerChange:

  • start(), enqueue PollEvent
  • run PollEvent, enqueue RecoverMigrationStateFromZKEvent (from PollEvent)
  • run RecoverMigrationStateFromZKEvent, state = INACTIVE
  • onControllerChange(), enqueue KRaftLeaderEvent
  • run KRaftLeaderEvent, state is INACTIVE ✅

Only Case 1 tests the race condition and the fix.

If we reverse the operations, we'll have

Case 3:

  • onControllerChange(), enqueue RecoverMigrationStateFromZKEvent and KRaftLeaderEvent
  • start(), enqueue PollEvent
  • run RecoverMigrationStateFromZKEvent, state = INACTIVE
  • run KRaftLeaderEvent, state is INACTIVE ✅
  • run PollEvent, no-op

So yea, it makes sense to use Case 3 for the test since it always contrives the race condition.

@mumrah mumrah merged commit 0e30209 into apache:trunk Sep 11, 2024
4 of 5 checks passed
@mumrah mumrah deleted the KAFKA-17506-KRaftMigrationDriver-init-race branch September 11, 2024 14:41
mumrah added a commit that referenced this pull request Sep 11, 2024
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state.

This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored.

Reviewers: Luke Chen <[email protected]>
mumrah added a commit that referenced this pull request Sep 11, 2024
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state.

This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored.

Reviewers: Luke Chen <[email protected]>
mumrah added a commit that referenced this pull request Sep 11, 2024
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state.

This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored.

Reviewers: Luke Chen <[email protected]>
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
There is a race condition between KRaftMigrationDriver running its first poll() and being notified by Raft about a leader change. If onControllerChange is called before RecoverMigrationStateFromZKEvent is run, we will end up getting stuck in the INACTIVE state.

This patch fixes the race by enqueuing a RecoverMigrationStateFromZKEvent from onControllerChange if the driver has not yet initialized. If another RecoverMigrationStateFromZKEvent was already enqueued, the second one to run will just be ignored.

Reviewers: Luke Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants