-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-15974: Enforce that event processing respects user-provided timeout #15640
KAFKA-15974: Enforce that event processing respects user-provided timeout #15640
Conversation
Yes, the network layer changes are captured in KAFKA-16200 and build on top of this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -281,64 +276,15 @@ void testEnsureMetadataUpdateOnPoll() { | |||
} | |||
|
|||
@Test | |||
void testEnsureEventsAreCompleted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this test without replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reinstated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually seems to me that we shouldn't have this test here (and maybe this is why @kirktrue removed it before?). As I see it, this unit test is testing something that is not the ConsumerNetworkThread
's responsibility (and that's why it ends up being complicated, having to mimic the reaper behaviour and spying). It is testing that events are completed, and that's the reaper.reap responsibility, so seems to me we need to:
- test that the
ConsumerNetworkThread
calls the reaper with the full list of events -> done already in the testCleanupInvokesReaper - test that the
CompletableEventReaper.reap(Collection<?> events)
completes the events -> done in CompletableEventReaperTest (testIncompleteQueue and testIncompleteTracked)
In the end, as it is, we end up asserting a behaviour we're mocking ourselves in the doAnswer
, so not much value I would say? Agree with @cadonna that we need coverage, but I would say that we have it, on my points 1 and 2, and this should be removed. Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the test was a little suspect in terms of its value-add, so I'd removed it.
I was planning to file a Jira to move several of the tests (including this one) from ConsumerNetworkThreadTest
to ApplicationEventProcessorTest
. Then we could fix up some of the funkiness in this test as a separate task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is all fine! I was not arguing that we need to keep the test, but if I see a test removed without replacement, I suspect a mistake. Which did apparently not happen in this case. Next time comment on the PR why you removed the test.
consumer = newConsumer(); | ||
completeUnsubscribeApplicationEventSuccessfully(); | ||
consumer.unsubscribe(); | ||
verify(backgroundEventReaper).reap(any(Long.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You control the time here. Why do you not verify that reap()
is called with the correct time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Done!
@Test | ||
void testRunOnceInvokesReaper() { | ||
consumerNetworkThread.runOnce(); | ||
verify(applicationEventReaper).reap(any(Long.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You control the time here. Why do you not verify that reap()
is called with the correct time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And done here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still have the change locally, because here it does still not verify the correct time?
@lianetm Thanks for the explanation! |
// Close the consumer here as we know it will cause a FencedInstanceIdException to be thrown. | ||
// If we get an error other than the FencedInstanceIdException, we'll raise a ruckus. | ||
try { | ||
consumer.close(); | ||
} catch (KafkaException e) { | ||
assertNotNull(e.getCause()); | ||
assertInstanceOf(FencedInstanceIdException.class, e.getCause()); | ||
} finally { | ||
consumer = null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect the close to throw? If so, we should verify that (at the moment our test will just complete successfully if the close does not throw). If that's the expectation, maybe this simpler snippet would cover it all:
// Close the consumer here as we know it will cause a FencedInstanceIdException to be thrown. | |
// If we get an error other than the FencedInstanceIdException, we'll raise a ruckus. | |
try { | |
consumer.close(); | |
} catch (KafkaException e) { | |
assertNotNull(e.getCause()); | |
assertInstanceOf(FencedInstanceIdException.class, e.getCause()); | |
} finally { | |
consumer = null; | |
} | |
Throwable e = assertThrows(KafkaException.class, () -> consumer.close()); | |
assertInstanceOf(FencedInstanceIdException.class, e.getCause()); | |
consumer = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did we resolve this? I see the section got completely removed, verification not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it turns out that changes made elsewhere have obviated the need for this check.
final Timer timer) { | ||
if (!shouldAutoCommit) | ||
return; | ||
void maybeAutoCommitSync(final Timer timer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a "maybe" anymore, so what about autoCommitSyncAllConsumed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to just autoCommitSync()
. Is that OK?
// First, complete (exceptionally) any events that have passed their deadline AND aren't already complete. | ||
tracked.stream() | ||
.filter(e -> !e.future().isDone()) | ||
.filter(e -> currentTimeMs > e.deadlineMs()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want >= here when identifying expired events? I would expect so (that's the semantic applied in the Timer
class isExpired for instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point 🤔
If a user provides a timeout of 1000 milliseconds, is it expired at 1000 milliseconds or at 1001 milliseconds?
Regardless, I will change it to >=
to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* could occur when processing the events. In such cases, the processor will take a reference to the first | ||
* error, continue to process the remaining events, and then throw the first error that occurred. | ||
*/ | ||
private boolean processBackgroundEvents(EventProcessor<BackgroundEvent> processor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This processor
passed as argument is in the end always a reference to the backgroundEventProcessor
, so could we simplify this, remove the arg and directly reference the var? It caught my attention when seeing how this is used, which seems a bit redundant with all calls having to provide the same processBackgroundEvents(backgroundEventProcessor, ...
which feels like an internal that the processBackgroundEvents
could know about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a unit test that passes in a mocked event processor. Let me look at refactoring this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. That's much better 😄
Co-authored-by: Lianet Magrans <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience and great effort here @kirktrue, LGTM to merge and move on with the follow ups. Just to recap, this is what I see should be address next related to timeout enforcement:
- https://issues.apache.org/jira/browse/KAFKA-16637
- https://issues.apache.org/jira/browse/KAFKA-16200
- https://issues.apache.org/jira/browse/KAFKA-16792
Also please let's have a jira to address this comment to remove the test we agreed brings no value.
Thanks again!
cc. @cadonna
I added KAFKA-16818 to cover the cases to refactor/migrate/remove tests. |
…eout (apache#15640) The intention of the CompletableApplicationEvent is for a Consumer to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values. The CompletableEventReaper is a new class that tracks CompletableEvents that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track any CompletableBackgroundEvents that it receives and the network I/O thread will do the same with any CompletableApplicationEvents it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a TimeoutException. On closing the AsyncKafkaConsumer, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a CancellationException instead of a TimeoutException to differentiate the two cases. The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause. Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change. This change also includes some minor refactoring of the EventProcessor and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events. Reviewers: Lianet Magrans <[email protected]>, Philip Nee <[email protected]>, Bruno Cadonna <[email protected]>
…eout (apache#15640) The intention of the CompletableApplicationEvent is for a Consumer to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values. The CompletableEventReaper is a new class that tracks CompletableEvents that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track any CompletableBackgroundEvents that it receives and the network I/O thread will do the same with any CompletableApplicationEvents it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a TimeoutException. On closing the AsyncKafkaConsumer, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a CancellationException instead of a TimeoutException to differentiate the two cases. The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause. Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change. This change also includes some minor refactoring of the EventProcessor and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events. Reviewers: Lianet Magrans <[email protected]>, Philip Nee <[email protected]>, Bruno Cadonna <[email protected]>
…eout (apache#15640) The intention of the CompletableApplicationEvent is for a Consumer to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values. The CompletableEventReaper is a new class that tracks CompletableEvents that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track any CompletableBackgroundEvents that it receives and the network I/O thread will do the same with any CompletableApplicationEvents it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a TimeoutException. On closing the AsyncKafkaConsumer, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a CancellationException instead of a TimeoutException to differentiate the two cases. The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause. Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change. This change also includes some minor refactoring of the EventProcessor and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events. Reviewers: Lianet Magrans <[email protected]>, Philip Nee <[email protected]>, Bruno Cadonna <[email protected]>
…eout (apache#15640) The intention of the CompletableApplicationEvent is for a Consumer to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values. The CompletableEventReaper is a new class that tracks CompletableEvents that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track any CompletableBackgroundEvents that it receives and the network I/O thread will do the same with any CompletableApplicationEvents it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a TimeoutException. On closing the AsyncKafkaConsumer, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a CancellationException instead of a TimeoutException to differentiate the two cases. The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause. Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change. This change also includes some minor refactoring of the EventProcessor and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events. Reviewers: Lianet Magrans <[email protected]>, Philip Nee <[email protected]>, Bruno Cadonna <[email protected]>
The intention of the
CompletableApplicationEvent
is for aConsumer
to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values.The
CompletableEventReaper
is a new class that tracksCompletableEvent
s that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track anyCompletableBackgroundEvent
s that it receives and the network I/O thread will do the same with anyCompletableApplicationEvent
s it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event'sCompletableFuture.completeExceptionally()
method with aTimeoutException
.On closing the
AsyncKafkaConsumer
, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event'sCompletableFuture.completeExceptionally()
method with aCancellationException
instead of aTimeoutException
to differentiate the two cases.The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause.
Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change.
This change also includes some minor refactoring of the
EventProcessor
and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events.Committer Checklist (excluded from commit message)