Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-17661: Fix flaky BufferPoolTest.testBlockTimeout #17319

Merged
merged 4 commits into from
Oct 11, 2024

Conversation

chenyulin0719
Copy link
Contributor

@chenyulin0719 chenyulin0719 commented Sep 30, 2024

It's regarding KAFKA-17661.

The test relies on 3 asynchronous threads being triggered in parallel with the test thread[1]. However, there is no guarantee of parallelism in test environment. The 25 ms delay(10 ms maxBlockTimeMs) is obviously unreliable in the test environment.

To address the racing between asynchronous threads, removing the third delayed deallocation to ensure pool.allocate(10, maxBlockTimeMs) always hits the timeout.

[1]

// The first two buffers will be de-allocated within maxBlockTimeMs since the most recent allocation
delayedDeallocate(pool, buffer1, maxBlockTimeMs / 2);
delayedDeallocate(pool, buffer2, maxBlockTimeMs);
// The third buffer will be de-allocated after maxBlockTimeMs since the most recent allocation
delayedDeallocate(pool, buffer3, maxBlockTimeMs / 2 * 5);

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@github-actions github-actions bot added tests Test fixes (including flaky tests) clients labels Sep 30, 2024
@chenyulin0719
Copy link
Contributor Author

chenyulin0719 commented Sep 30, 2024

Another possible solution is to remove the asynchronous delayedDeallocate() in testBlockTimeout.
But I prefer to keep the more complexity test since there are no similar tests existing in BufferPoolTest.

@chia7712
Copy link
Member

chia7712 commented Oct 7, 2024

Another possible solution is to remove the asynchronous delayedDeallocate() in testBlockTimeout.

In fact, this solution looks good to me. We can have a follow-up to add multi-threaded tests for deallocate and allocate, but I don't think this test case requires asynchronous operations.

@chenyulin0719
Copy link
Contributor Author

@chia7712 That make sense to me. I will rewrite this PR.

@chenyulin0719 chenyulin0719 marked this pull request as draft October 7, 2024 08:40
@chenyulin0719 chenyulin0719 force-pushed the KAFKA-17661-fix-testBlockTimeout branch from ffad249 to 4dc55ca Compare October 7, 2024 19:06
@github-actions github-actions bot added the small Small PRs label Oct 7, 2024
@chenyulin0719
Copy link
Contributor Author

chenyulin0719 commented Oct 7, 2024

Hi @chia7712,

After rethink about it, I believe we can resolve the racing issue while still keeping the multi-threading test logic by removing the third delayed deallocation.

The root casue of racing is all the three delayed deallocation threads occur before the test thread. In this case pool.allocate(10, maxBlockTimeMs in test thread will get memory immediately.
So we can remove the third deallocation to make sure the pool.allocate(10, maxBlockTimeMs) will never be satisfied.

The following test target can be kept in the test:

  1. pool.allocate(10, maxBlockTimeMs) waits for maxBlockTimeMs for more memory (Link)
  2. pool.allocate(10, maxBlockTimeMs) won't returned early when receiving a notification from other thread. (Link)

Below is the detailed explanation for the new test:

  • [test thread] allocate buffer 1 with 1 memory
  • [test thread] allocate buffer 2 with 1 memory
  • [test thread] allocate buffer 3 with 1 memory
  • [test thread] request buffer with 10 memory, await for more memroy
  • [delayed thread] At 5 ms, deallocate buffer 1, signal test thread
  • [test thread] Awaken at 5 ms, await again for more memory
  • [test thread] pool.allocate(10, maxBlockTimeMs timeout at 10 ms -> Can run assertions after this point.
  • [delayed thread] At 10 ms, deallocate buffer 2

No matter when the delayed thread occurs, the timeout in the test thread always hits.
(pool.allocate(10, maxBlockTimeMs) never get enough memory)

Just updated the PR to this version. The test passes 500 loops in my local env. And also passes after adding delay to test thread.

@chenyulin0719 chenyulin0719 marked this pull request as ready for review October 7, 2024 19:56
@chia7712
Copy link
Member

chia7712 commented Oct 8, 2024

(pool.allocate(10, maxBlockTimeMs) never get enough memory)

What if the previous two threads haven’t allocated before this allocation? Would it still encounter a buffer exception?

@chenyulin0719
Copy link
Contributor Author

chenyulin0719 commented Oct 8, 2024

@chia7712
The asynchronous thread is for pool.deallocate(), not for allocation.

What if the previous two threads haven’t allocated before this allocation?

So it won't happen, the first three pool.allocate(1, maxBlockTimeMs) calls are run in the test thread, they are synchronous.

@chia7712
Copy link
Member

chia7712 commented Oct 8, 2024

The asynchronous thread is for pool.deallocate(), not for allocation.

Sorry for the typo. My point is that deallocate is a no-op in this test, as the case will pass even if those threads haven't deallocated. If you prefer the solution of removing the threads, you should remove all deallocation threads

@chenyulin0719
Copy link
Contributor Author

My point is that deallocate is a no-op in this test, as the case will pass even if those threads haven't deallocated.

You're right, removing the async deallocation doesn't change the test result.
The reason I decided to keep the async deallocation is that the deallocate threads are intended to validate this correct behavior under multi-threading scenario:

I won't say the deallcoate is a no-op in this test, but the orignal test name is not accurate.
I can split the test into 2 tests as you suggested previously. Split the testBlockTimeout() into below tests:

  1. testBlockTimeout()
  2. testAsyncDeallocateNotBreakBlockTimeout()

@chia7712 WDYT?

@chia7712
Copy link
Member

chia7712 commented Oct 8, 2024

When a block allocate() is signaled by other thread's deallocate() call (BufferPool.java#L273-L275), it still re-enter the the while loop until hits the block timeout. (BufferPool.java#L148-L153)
I won't say the deallcoate is a no-op in this test, but the orignal test name is not accurate.

I don't think this test case has actually tested this behavior, which is why I believe the deallocate is a no-op, as the related behavior isn't asserted in the test.

I can split the test into 2 tests as you suggested previously. Split the testBlockTimeout() into below tests:

Perhaps this PR can simplify the test case to address the flakiness, and then you can file a follow-up to add more test cases.

@chenyulin0719
Copy link
Contributor Author

OK, I just update this PR to simplify the test case.
We can create follow-up to discuss whether we should bring the multi-threading test back.

@mumrah
Copy link
Member

mumrah commented Oct 8, 2024

@chenyulin0719 fyi, there was a build issue on trunk. I have fixed the issue and updated your branch with the fix.

@chenyulin0719
Copy link
Contributor Author

@mumrah Thank you for letting me know. :)

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenyulin0719 Please file a jira as follow-up to add more test cases

@chia7712 chia7712 merged commit e864d8f into apache:trunk Oct 11, 2024
6 checks passed
LoganZhuZzz pushed a commit to LoganZhuZzz/kafka that referenced this pull request Oct 12, 2024
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clients producer small Small PRs tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants