KAFKA-17439: Make polling for new records an explicit action/event in the new consumer #17035

kirktrue · 2024-08-28T19:44:46Z

Updated the FetchRequestManager to only create and enqueue fetch requests when signaled to do so by a FetchEvent.

The application thread and the background thread each contains logic that is performed if there is buffered data from a previous fetch. There's a race condition because the presence of buffered data could change between the two threads’ respective checks. Right now the window for the race condition to occur is wide open; this change aims to make the window ajar.

In the ClassicKafkaConsumer, the application thread will explicitly issue fetch requests (via the Fetcher class) at specific points in the Consumer.poll() cycle. Prior to this change, the AsyncKafkaConsumer would issue fetch requests independently from the user calling Consumer.poll(); the fetches would happen nearly continuously as soon as any assigned partition was fetchable. With this change, the AsyncKafkaConsumer introduces a FetchEvent that signals to the background thread that a fetch request should be issued. The specific points where this is done in the Consumer.poll() cycle of the AsyncKafkaConsumer now match the ClassicKafkaConsumer. In short: this makes AsyncKafkaConsumer act nearly identical to the ClassicKafkaConsumer in this regard.

As mentioned above, this change does not completely solve the problem related to fetch session eviction. Exactly how the window where the race condition can be shut completely is outside the scope of this change.

See KAFKA-17182.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

AndrewJSchofield

Thanks for the PR. I'm quite surprised to see how little code needed to change to make the fetching explicit. A couple of points.

First, I think it better not to overload the PollEvent because that's already used in the share consumer.

Second, it seems to me that there is still the potential for over-fetching, and this will still cause churn of the fetch session cache.

In the case where the consumer is only fetching a single partition, I think it works pretty well. The set of fetchable partitions will be empty if there's buffered data, and contain the only partition in the fetch session if there is not. So, you'll only send a Fetch request when there's a need for more data and the fetch session will not churn.

In the case where the consumer is fetching more than one partition on a particular node, if a subset of the partitions is fetchable, then the fetch session will be modified by sending a Fetch request and that seems to have the potential for a lot of churn.

Of course, all of this code is in common between the legacy consumer and the async consumer. The async consumer is still very keen on fetching so I don't properly grasp why this PR would make the fetch session behaviour better.

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java

kirktrue · 2024-09-03T21:42:31Z

Hi @AndrewJSchofield!

Thanks for the review 👍

First, I think it better not to overload the PollEvent because that's already used in the share consumer.

Agreed. I've introduced a FetchEvent so that the two separate mechanisms won't step on each others' toes.

Second, it seems to me that there is still the potential for over-fetching, and this will still cause churn of the fetch session cache.

Agreed. It aims to lessen the churn. Preventing the churn completely is a future task.

In the case where the consumer is only fetching a single partition, I think it works pretty well. The set of fetchable partitions will be empty if there's buffered data, and contain the only partition in the fetch session if there is not. So, you'll only send a Fetch request when there's a need for more data and the fetch session will not churn.

Correct.

In the case where the consumer is fetching more than one partition on a particular node, if a subset of the partitions is fetchable, then the fetch session will be modified by sending a Fetch request and that seems to have the potential for a lot of churn.

Correct again!

Any partition with buffered data at the point where the fetch request is being generated will be marked as "removed" from the broker's fetch session cache. That's the crux of the problem 😞

Something that I tend to lose sight of is the fact that it's not a foregone conclusion that a fetch session will be evicted when it has partitions removed. Of course, it will increase its eligibility for eviction if the broker hosting the fetch session is resource-constrained and invokes the eviction process.

Of course, all of this code is in common between the legacy consumer and the async consumer.

I'm not sure I follow. This code is all specific to the AsyncKafkaConsumer. While the ClassicKafkaConsumer has a similar race condition, it is 2-4 orders of magnitude less likely to happen.

The async consumer is still very keen on fetching so I don't properly grasp why this PR would make the fetch session behaviour better.

Yep—the design of the AsyncKafkaConsumer fetching continuously in the background makes it very keen to cause this problem. With this change, the application thread now signals when to fetch, which results in the background thread creating and issuing the fetch requests much less often.

Thanks!

kirktrue · 2024-09-03T21:45:32Z

@AndrewJSchofield, et al.—it can be helpful to compare the flow of ClassicKafkaConsumer.poll() and AsyncKafkaConsumer.poll(), specifically how it invokes fetch. Note that the sendFetches() method name, as well as when it is invoked, comes from the ClassicKafkaConsumer.poll(). So this is really making the new consumer act much more like the old one.

… the new consumer Updated the FetchRequestManager to only create and enqueue fetch requests when signaled to do so by a FetchEvent.

AndrewJSchofield

Thanks for the updates and for the explanation of the mechanism.

I think it would be appropriate to test in FetchRequestManagerTest the mechanism of the pending fetch request future in the various permutations.

clients/src/main/java/org/apache/kafka/clients/consumer/internals/events/FetchEvent.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java

…licitly

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

…equests

…om prepareFetchRequests()

lianetm

Hey @kirktrue , thanks for the updates, some comments...

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

…en continuously anymore

lianetm

Thanks for the updates!

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java

Allowing sendFetches to block in order to communicate errors, but it will suppress timeouts.

kirktrue · 2024-10-22T18:33:40Z

@lianetm—tests are passing and all comments have been addressed. Can you make another review pass? Thanks!

lianetm

Thanks for the updates @kirktrue! Took another pass and left some comments for consideration.

lianetm · 2024-10-24T13:31:49Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java

                prepareFetchRequests(),
                this::handleFetchSuccess,
                this::handleFetchFailure
-        );
+            );
+            pendingFetchRequestFuture.complete(null);


do we need to complete this future also on pollOnClose? there may be a pendingFetchRequestFuture there that won't be completed (not that I'm seeing how leaving that future uncompleted on close will cause a problem but seems safer to complete it, consistently with how we do it here after pollInternal)

I've moved the Future-handling code to pollInternal() for consistency. LMK what you think.

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java

lianetm · 2024-10-24T14:50:36Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

@@ -1520,6 +1523,9 @@ private Fetch<K, V> pollForFetches(Timer timer) {
            return fetch;
        }

+        // send any new fetches (won't resend pending fetches)
+        sendFetches(timer);


The actual poll now happens in here (addAndGet that will complete when the background has had one run, called fetchMgr.poll), so should the log line on ln 1538 "Polling for fetches with timeout..." be right before this?

We're not polling for the incoming responses in sendFetches(), just enqueuing the outgoing requests. This mimics the ClassicKafkaConsumer in that the requests are enqueued in its sendFetches() but then toward the bottom of pollForFetches() client.poll() is invoked to wait for the results of the fetch requests.

well the sendFetches blocks until the CreateFetchRequestsEvent completes, and that only happens on fetchMgr.poll

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java

Line 115 in aae7e97

pendingFetchRequestFuture.complete(null);

So when the sendFetches completes we did poll the manager right? (and depending on time, maybe we did poll the client.poll too, which happens in the background right after polling all managers). That's why the log for "Polling for fetches" made sense to me before the sendFetches, but am I missing another poll happening after the log line maybe? (where it is now)

The two ConsumerDelegate implementations work differently:

AsyncKafkaConsumer: FetchRequestManager.poll() will complete the event's Future on the background thread before it exits, i.e. before the thread starts the network I/O. Completing the Future starts the application thread racing toward logging that message and the background thread racing toward starting network I/O. I'll admit—I haven't dug through the code to surmise the relative costs of each thread's work before either cross their respective finish lines to win.

ClassicKafkaConsumer: Fetcher.sendFetchesInternal() calls ConsumerNetworkClient.send() to enqueue the request, but then it calls NetworkClient.wakeup(). Since the same ConsumerNetworkClient instance used by the consumer is also used by AbstractCoordinator.HeartbeatThread, it's technically possible that the heartbeat thread's run() method could start network I/O when it calls NetworkClient.pollNoWakeup(). Granted, that's a race that the application thread is much more likely to win given that the heartbeat thread runs much less frequently.

Here are some points to consider:

The definition of the term "poll" as used in the log is open to interpretation. The term "poll" is everywhere, making its meaning ambiguous at any given point of use 😢

I agree there is a race condition (for both consumers, but more likely for the new consumer) that could result the the log message being emitted after the network I/O has commenced

For this to pose a problem to users, there needs to be other log entries that we're racing with, right?. We're trying to avoid the condition where the user is confused/mislead because the entries in the log are emitted in non-deterministic ordering.

The log line in question is only output at level TRACE, which I assume is very rare for users to enable.

Given the above, I'm of the opinion that it's an exercise in hair splitting to alter the logging. However, I could also just change it which would have been way less effort than researching, thinking, and composing this response 🤣

If we leave the log line as it is, what would the effect be for the user?

I surely didn't intend for you to put up that long response he he, sorry. It's not about the log line per-se, it's about the alignment on where the poll happens. The classic consumer logs "Polling for records", then calls client.poll. vs us here we do sendFetches (which triggers the client.poll async in the background thread because it blocks until we poll the fetch manager), then log "Polling for fetches...".

That's the diff I saw and just wanted to understand/align on where the poll happens: once we trigger sendFetches (blocking), the client.poll will happen in the background anytime, not controlled by the app thread. Agreed? If so I'm ok with leaving the log unchanged, understanding it could come out after the client.poll happened.

That's the diff I saw and just wanted to understand/align on where the poll happens: once we trigger sendFetches (blocking), the client.poll will happen in the background anytime, not controlled by the app thread. Agreed?

Agreed—the background thread is going to move from calling each of the RequestManager’s poll() method to NetworkClient.poll() method without the intervention of the application thread.

If so I'm ok with leaving the log unchanged, understanding it could come out after the client.poll happened.

Thanks!

lianetm · 2024-10-24T15:04:57Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

@@ -707,6 +708,8 @@ public ConsumerRecords<K, V> poll(final Duration timeout) {
                updateAssignmentMetadataIfNeeded(timer);
                final Fetch<K, V> fetch = pollForFetches(timer);
                if (!fetch.isEmpty()) {
+                    sendFetches(timer);


at this point we may already have records in hand to return (consumed position updated), so we should be very careful to not throw any error here. But this sendFetches could throw interrupted because of the addAndGet right?

Shouldn't we just do a best effort to pipeline the next requests using add instead of addAndGet? It would achieve what we want, removing the risk of errors, and it would actually align better with what the classic does on this sendFetches + transmitSends:

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkClient.java

Lines 327 to 329 in 140d35c

* Poll for network IO in best-effort only trying to transmit the ready-to-send request

* Do not check any pending requests or metadata errors so that no exception should ever

* be thrown, also no wakeups be triggered and no interrupted exception either.

lianetm

Thanks for the updates @kirktrue! Just one nit left, almost there.

lianetm · 2024-10-28T13:15:00Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+     *
+     * <ul>
+     *     <li>
+     *         The method will wait for confirmation of the request creation before continuing.


This is not true now for prefetching that uses .add instead of .addAndGet, should we remove this line?

Good catch. Reworded to state that it will not wait for confirmation.

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java

lianetm

Thanks for all the updates @kirktrue! LGTM.

… the new consumer (apache#17035) Reviewers: Andrew Schofield <[email protected]>, Lianet Magrans <[email protected]>

kirktrue added consumer KIP-848 The Next Generation of the Consumer Rebalance Protocol labels Aug 29, 2024

StanleyWu8787 reviewed Aug 30, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java Outdated Show resolved Hide resolved

AndrewJSchofield requested changes Aug 30, 2024

View reviewed changes

.../main/java/org/apache/kafka/clients/consumer/internals/events/ApplicationEventProcessor.java Outdated Show resolved Hide resolved

kirktrue added the ctr Consumer Threading Refactor (KIP-848) label Sep 3, 2024

kirktrue force-pushed the KAFKA-17439-poll-explicitly branch 2 times, most recently from 9527e8f to a8567c9 Compare September 5, 2024 19:09

KAFKA-17439: Make polling for new records an explicit action/event in…

e984638

… the new consumer Updated the FetchRequestManager to only create and enqueue fetch requests when signaled to do so by a FetchEvent.

kirktrue force-pushed the KAFKA-17439-poll-explicitly branch from bb7efc1 to e984638 Compare September 5, 2024 19:10

AndrewJSchofield requested changes Sep 6, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/events/FetchEvent.java Outdated Show resolved Hide resolved

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java Show resolved Hide resolved

lianetm mentioned this pull request Sep 10, 2024

KAFKA-16792: Enable consumer unit tests that fail to fetch offsets only for new consumer with poll(0) #16982

Merged

3 tasks

lianetm reviewed Sep 10, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java Outdated Show resolved Hide resolved

kirktrue added 3 commits September 10, 2024 17:03

Merge remote-tracking branch 'origin/trunk' into KAFKA-17439-poll-exp…

b6af23b

…licitly

Minor tweaks to FetchEvent documentation.

335c249

Update FetchRequestManager.java

2abd6a4

kirktrue requested review from AndrewJSchofield and lianetm September 11, 2024 00:16

lianetm reviewed Sep 12, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java Outdated Show resolved Hide resolved

lianetm reviewed Sep 12, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java Outdated Show resolved Hide resolved

lianetm reviewed Sep 12, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java Outdated Show resolved Hide resolved

lianetm reviewed Sep 12, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java Outdated Show resolved Hide resolved

kirktrue added 3 commits September 12, 2024 18:40

Added unit tests to exercise poll, request-then-poll, and duplicate r…

45bc8c0

…equests

Updated FetchEvent to CreateFetchRequestsEvent and catching errors fr…

7265404

…om prepareFetchRequests()

Fixed spacing issue that checkstyle wasn't happy with

5449549

lianetm reviewed Sep 24, 2024

View reviewed changes

kirktrue added the clients label Sep 26, 2024

kirktrue added 2 commits September 26, 2024 12:08

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

f7a5940

Update KafkaConsumerTest.java

7acf732

kirktrue added 7 commits October 1, 2024 12:53

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

7cda968

Update AsyncKafkaConsumer so that prefetch requests don't block

de2c961

Including fix for test that uses add() instead of addAndGet()

a0beb14

Added requestGenerated back that somehow was removed

af96038

No longer need to wait for FETCH RPC in test since fetches don't happ…

09e951e

…en continuously anymore

Testing that Future.get() throws a specific exception type

fbd147c

Fixed spotless complaints

4c7c1bf

lianetm reviewed Oct 7, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java Outdated Show resolved Hide resolved

clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchRequestManager.java Outdated Show resolved Hide resolved

kirktrue added 4 commits October 7, 2024 14:14

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

3aed5e7

Update AsyncKafkaConsumer.java

b097ebe

Allowing sendFetches to block in order to communicate errors, but it will suppress timeouts.

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

62ccaa6

Reducing diff noise

01b5d7a

kirktrue added the ci-approved label Oct 17, 2024

Updated test to check that CreateFetchRequests event uses addAndGet

aae7e97

lianetm reviewed Oct 24, 2024

View reviewed changes

kirktrue added 4 commits October 25, 2024 11:00

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

b942a46

Updates to suppress exceptions for pre-fetch and handle pollOnClose()

63abc91

Updates to let pre-fetch be asynchronous

a98e8b0

Tweak to comment

d6e2241

lianetm reviewed Oct 28, 2024

View reviewed changes

kirktrue added 2 commits October 28, 2024 10:31

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

1e832e0

Updated comments for sendPrefetches to correctly reflect implementation

4071849

lianetm approved these changes Oct 28, 2024

View reviewed changes

lianetm merged commit 9e42475 into apache:trunk Oct 28, 2024
6 checks passed

kirktrue deleted the KAFKA-17439-poll-explicitly branch October 28, 2024 23:50

chiacyu pushed a commit to chiacyu/kafka that referenced this pull request Nov 30, 2024

KAFKA-17439: Make polling for new records an explicit action/event in…

057be1b

… the new consumer (apache#17035) Reviewers: Andrew Schofield <[email protected]>, Lianet Magrans <[email protected]>

tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025

KAFKA-17439: Make polling for new records an explicit action/event in…

66c42c7

… the new consumer (apache#17035) Reviewers: Andrew Schofield <[email protected]>, Lianet Magrans <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-17439: Make polling for new records an explicit action/event in the new consumer #17035

KAFKA-17439: Make polling for new records an explicit action/event in the new consumer #17035

kirktrue commented Aug 28, 2024 •

edited

Loading

AndrewJSchofield left a comment

kirktrue commented Sep 3, 2024

kirktrue commented Sep 3, 2024

AndrewJSchofield left a comment

lianetm left a comment

lianetm left a comment

kirktrue commented Oct 22, 2024

lianetm left a comment

lianetm Oct 24, 2024

kirktrue Oct 25, 2024

lianetm Oct 24, 2024

kirktrue Oct 25, 2024

lianetm Oct 25, 2024

kirktrue Oct 25, 2024

lianetm Oct 25, 2024

kirktrue Oct 28, 2024

lianetm Oct 24, 2024

kirktrue Oct 25, 2024

lianetm left a comment

lianetm Oct 28, 2024

kirktrue Oct 28, 2024

lianetm left a comment

	* Poll for network IO in best-effort only trying to transmit the ready-to-send request
	* Do not check any pending requests or metadata errors so that no exception should ever
	* be thrown, also no wakeups be triggered and no interrupted exception either.

KAFKA-17439: Make polling for new records an explicit action/event in the new consumer #17035

KAFKA-17439: Make polling for new records an explicit action/event in the new consumer #17035

Conversation

kirktrue commented Aug 28, 2024 • edited Loading

Committer Checklist (excluded from commit message)

AndrewJSchofield left a comment

Choose a reason for hiding this comment

kirktrue commented Sep 3, 2024

kirktrue commented Sep 3, 2024

AndrewJSchofield left a comment

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

kirktrue commented Oct 22, 2024

lianetm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

kirktrue commented Aug 28, 2024 •

edited

Loading