KAFKA-8421: Still return data during rebalance #7312

guozhangwang · 2019-09-07T05:16:44Z

Not wait until updateAssignmentMetadataIfNeeded returns true, but only call it once with 0 timeout. Also do not return empty if in rebalance.
Trim the pre-fetched records after long polling since assignment may have been changed.
Also need to update SubscriptionState to retain the state in assignFromSubscribed if it already exists (similar to assignFromUser), so that we do not need the transition of INITIALIZING to FETCHING.
Unit test: this actually took me the most time :)

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

guozhangwang · 2019-09-07T05:16:56Z

@ableegoldman @hachikuji for reviews.

guozhangwang · 2019-09-09T20:17:07Z

retest this please

guozhangwang · 2019-09-10T05:10:35Z

Also cc @ConcurrencyPractitioner who've looked into this.

…turn-data-during-rebalance

guozhangwang · 2019-09-10T16:41:41Z

retest this please

ableegoldman · 2019-09-10T20:58:32Z

Every build has 10+ failures, mostly the XXX.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl tests -- are these just broken on trunk or are you looking into them?

guozhangwang · 2019-09-10T21:10:13Z

I ran them locally but they do pass (some are indeed flaky: with 10 runs they are going to fail at least once).

I will have one more commit trying to address some comments and will ping again when I'm done.

hachikuji · 2019-09-10T22:10:09Z

Discussed a little bit offline. We need to be a little careful with how an active rebalance affects other consumer operations. Specifically, a call to commitSync might fail spuriously if it happens to get sent on the wire after a JoinGroup. It is doomed to fail with an illegal generation error after the JoinGroup completes and the generation is bumped. We probably should try to handle this.

I think it might also be possible to get into a bad state if we are stuck between joining and syncing. The SyncGroup request will not be retried automatically. We rely on a call to poll() in order to drive retries. If the group member happens to be the group leader, then a failure to send the SyncGroup would mean the group would be stuck in the CompletingRebalance state and no offset commits would succeed.

ConcurrencyPractitioner

Overall, LGTM. Just a single meta-comment.

ConcurrencyPractitioner · 2019-09-10T22:32:28Z

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

+                    // after the long poll, we should filter the returned data if their belonging
+                    // partitions are no longer owned by the consumer
+                    final Set<TopicPartition> assignedPartitions = subscriptions.assignedPartitions();
+                    return this.interceptors.onConsume(new ConsumerRecords<>(records.entrySet()


A meta-comment here.

I think that filtering the records under most conditions would be redundant since the assignment usually does not change. Since the assignment remains static unless a rebalance occurs, I think that we only need to filter records when we have found that the assignment has changed like after a rebalance is finished. In the case where the assignment does not change, this segment of the code has no effect since the assignment has remained the same. It probably would act as a substantial performance hit since this portion of the code is often heavily congested.

We probably should have some sort of check that only filters the records when the assignment has changed.

That's a fair point. Originally I thought that the returned records are in the form of Map<TopicPartition, List<ConsumerRecord<K, V>>> and we are filtering per topic-partition not per-record, so it may be okay; but if there's a better way to avoid unnecessary stream() call we should do it.

One thing I can think of is to leverage on Fetcher#clearBufferedDataForUnassignedPartitions and call that upon partition assignment changes. But since the background thread can also handle the fetch response it means that concurrent access on completedFetches is possible, and its iterator() is only weakly consistent: if the background thread is adding new batches to that list while the caller thread is iterating / removing from that list, it may not get removed.

Seems that with the locking on ConsumerNetworkClient#poll the above should never happen (cc @hachikuji to confirm), in which case I think the above idea should work.

Actually, I think this check is not needed, since in fetcher#fetchRecords we already do this filtering:

if (!subscriptions.isAssigned(completedFetch.partition)) { // this can happen when a rebalance happened before fetched records are returned to the consumer's poll call log.debug("Not returning fetched records for partition {} since it is no longer assigned", completedFetch.partition); } else if (!subscriptions.isFetchable(completedFetch.partition)) { // this can happen when a partition is paused before fetched records are returned to the consumer's // poll call or if the offset is being reset log.debug("Not returning fetched records for assigned partition {} since it is no longer fetchable", completedFetch.partition); }

And since the assignment would ONLY be updated at the caller thread, and never at the background thread, we are safe to use this field to filter the fetched records which are only returned to the caller thread and hence ordering is guaranteed.

Ah, I see.

Edit: Actually, upon closer inspection, it appears that I had got the order mixed up. Looks like all things are accounted for. That was my mistake.

ConcurrencyPractitioner · 2019-09-10T22:33:41Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java

+            throw new IllegalArgumentException("Attempt to dynamically assign partitions while manual assignment in use");
+
+
+        Map<TopicPartition, TopicPartitionState> assignedPartitionStates = partitionToStateMap(assignments);


Nit: should this be declared final?

ConcurrencyPractitioner · 2019-09-10T22:34:07Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java

@@ -222,21 +222,45 @@ public synchronized boolean assignFromUser(Set<TopicPartition> partitions) {
        if (this.assignment.partitionSet().equals(partitions))
            return false;

+        Map<TopicPartition, TopicPartitionState> assignedPartitionStates = partitionToStateMap(partitions);


Nit: should declare final as well?

…turn-data-during-rebalance

guozhangwang · 2019-09-12T01:56:48Z

@hachikuji Here's my thoughts about the interleaving of rebalance and commit requests:

If a commit request is sent before the join-group request, it is normally handled just like a commit request sent in the onPartitionsRevoked callback and nothing changes.
If a commit request is sent after the join-group request, but before the response is received. It has two scenarios:

2.a) The group is still in Prepare_Rebalance phase, then it is still normally handled.

Note that if the join-group failed with a fatal error, the commit request would also fail with the same error (confirmed with code); if the join-group failed with a retriable error, the commit request would also fail with that error. Among them:

COORDINATOR_LOAD_IN_PROGRESS would just cause us to retry, which would eventually succeed or fail.
COORDINATOR_NOT_AVAILABLE and NOT_COORDINATOR would re-discover coordinator and retry, which would still succeed of fail.

2.b) The group is already in Completing_Rebalance phase because it has either elapsed the rebalance timeout or all members have joined. Then the generation has already incremented, and the commit would fail with a fatal ILLEGAL_GENERATION immediately and the client will throw with CommitFailedException.

If a commit request is sent after the sync-group request, but before the response is received. In this case the generation id of the commit request would also be incremented. It also has two cases:

3.a) The group has already transited to Stable since it has received the sync-group from leader, it is normally handled as if it is in the new generation already.

With incremental protocol it is fine since those owned partitions would not be re-assigned to others in that rebalance immediately; with eager protocol we just need to make sure that nothing gets sent in offset commits since nothing is owned.

3.b) The group has not transited, and is still in Completing_Rebalance phase, the generation id would match though. In this case a retriable REBALANCE_IN_PROGRESS is returned. However the error code would be handled as a fatal CommitFailedException on the consumer client.

If a commit request is sent after the sync-group response, it is normally handled just like a commit request after the rebalance.

So in sum the commitSync call should not be stuck while interleaving with the rebalance, but it may indeed fail with CommitFailed ( case 2.b) and 3.b)). In many callers, CommitFailed is treated as a fatal error: for example, in Streams it is translated as TaskMigrated and may cause unnecessary error handling. However, in those cases it is not necessarily a fatal error.

I’d propose we do the following:

a. Filter the passed in offsets map with assigned partitions (this is to help case 3.a)).
b. For case 2.b) and 3.b) since we would only update the assignment inside onJoinComplete which would only be triggered within the poll call, and before that's done we are never safe to say if the given offsets can still be committed or not, so we will have to still fail the commit. However, instead of throwing CommitFailed we would use RebalanceInProgress instead. We'd leave it to the caller to decide that, after the rebalance is completed via the next poll call, whether the offsets to be sent can still be retried.

WDYT?

…turn-data-during-rebalance

guozhangwang · 2019-09-17T06:19:00Z

@ableegoldman @hachikuji @RichardYuSTUG I've updated the PR with the above analysis, would like you to take another look.

…turn-data-during-rebalance

guozhangwang · 2019-09-17T22:26:54Z

Failed tests pass locally, retest this please

guozhangwang · 2019-09-17T22:27:13Z

retest this please

guozhangwang · 2019-09-18T05:24:11Z

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

+                    // since even if we are 1) in the middle of a rebalance or 2) have partitions
+                    // with unknown starting positions we may still want to return some data
+                    // as long as there are some partitions fetchable
+                    updateAssignmentMetadataIfNeeded(time.timer(1L));


This is the tricky part: I have to use a non-zero timer so that blocking rpc like find-coordinator is guaranteed to poll for once, otherwise the future.hasExpired will trigger immediately and we would doomed to be not finishing that call.

I am wondering about two things here:

How robust is this solution? Is it guaranteed that with the timer set to 1 the call is done?

If I look at

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkClient.java

Lines 210 to 215 in d112ffd

public boolean poll(RequestFuture<?> future, Timer timer) {

do {

poll(timer, future);

} while (!future.isDone() && timer.notExpired());

return future.isDone();

}

the poll is done also if the timer is set to zero. Do I look at the wrong code? If yes, where is the code you refer to?

If the timer is set to zero, it means that we would not send the request at all -- note that client.send call just queued up the request in the queue, and only client.poll would write it to the socket, and that's why I cannot have the timer to be 0.

Setting it to be 1 should be robust since the timer is initialized at the beginning of the call and until the first client.poll() it would not be checked, and hence not returned early.

I've further polished it to only use timer(1) if the remaining is still > 0, in this case we still make sure that consumer.poll(0) can return instantaneously.

guozhangwang · 2019-09-18T05:24:40Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

            if (generation == null) {
                log.info("Failing OffsetCommit request since the consumer is not part of an active group");
-                return RequestFuture.failure(new CommitFailedException());
+                return RequestFuture.failure(new RebalanceInProgressException("Offset commit cannot be completed since the " +


This is the main proposal for returning a different exception.

Is it always the case that if generation is null that it's due to a rebalance in progress? Or do we want to keep the ComitFailedException and add to the exception message that the exception could be due to a rebalance?

This is a good question, I would replace this case with the RetriableCommitFailed exception (see my other comment below).

guozhangwang · 2019-09-18T05:25:02Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

                             */
                            requestRejoin();
+                            future.raise(new RebalanceInProgressException());


This is the main proposal for returning a different exception.

cadonna

@guozhangwang Thanks for the PR.

I did a first pass.

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java

cadonna · 2019-09-25T11:07:19Z

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

+                    // since even if we are 1) in the middle of a rebalance or 2) have partitions
+                    // with unknown starting positions we may still want to return some data
+                    // as long as there are some partitions fetchable
+                    updateAssignmentMetadataIfNeeded(time.timer(1L));


I am wondering about two things here:

How robust is this solution? Is it guaranteed that with the timer set to 1 the call is done?

If I look at

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkClient.java

Lines 210 to 215 in d112ffd

public boolean poll(RequestFuture<?> future, Timer timer) {

do {

poll(timer, future);

} while (!future.isDone() && timer.notExpired());

return future.isDone();

}

the poll is done also if the timer is set to zero. Do I look at the wrong code? If yes, where is the code you refer to?

cadonna · 2019-09-25T11:09:57Z

clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java

+        consumer.poll(Duration.ZERO);
+
+        assertEquals(Utils.mkSet(topic, topic2), consumer.subscription());
+        assertEquals(Utils.mkSet(tp0, t2p0), consumer.assignment());


Do you really need this checks to verify your code?

I want to make sure my edits on updateAssignmentMetadataIfNeeded did not change any existing logic.

…turn-data-during-rebalance

guozhangwang · 2020-01-03T01:36:27Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

+
+                // it's possible that the partition is no longer assigned when the response is received,
+                // so we need to ignore seeking if that's the case
+                if (this.subscriptions.isAssigned(tp))


@hachikuji This is the change I make for auto offset reset.

I thought about adding a unit test but since the logic is wrapped in a single refreshCommittedOffsetsIfNeeded it is hard to change subscription in between without breaking it into multiple ones, but I feel it is too messy to test a singleton function so I did not add one.

Does it make sense to move this check above a little bit?

if (offsetAndMetadata != null && subscriptions.isAssigned(tp)) {

That will make the log message less confusing.

Hmm, but I thought even if the partition is no longer assigned, we may still want to update its epoch; on the other hand your concern is valid. Will tweak it a bit more.

hachikuji · 2020-01-03T21:23:11Z

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

+     *            This can occur if, e.g. consumer instance is in the middle of a rebalance so it is not yet determined
+     *            which partitions would be assigned to the consumer yet. In such cases you can first complete the rebalance
+     *            by calling {@link #poll(Duration)} and retry committing offsets again. NOTE when you retry after the
+     *      *            rebalance the assigned partitions may have changed, and also for those partitions that are still assigned


This javadoc needs to be fixed

…turn-data-during-rebalance

guozhangwang · 2020-01-05T02:17:17Z

@hachikuji it is ready for another look.

hachikuji

Thanks, left a few more comments.

hachikuji · 2020-01-06T17:21:13Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java

+        } catch (final RetriableCommitFailedException error) {
+            // commitSync throws this error and can be ignored (since EOS is not enabled, even if the task crashed
+            // immediately after this commit, we would just reprocess those records again)
+            log.info("Committing failed with a non-fatal error, we can ignore this since commit may succeed still");


Just in case, it's probably a good idea to include the exception in the message.

hachikuji · 2020-01-06T17:24:19Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

            final RequestFuture<Void> future = lookupCoordinator();
            client.poll(future, timer);

            if (!future.isDone()) {
                // ran out of time
+                future.addListener(new RequestFutureListener<Void>() {


Do we want this listener only when coordinator lookup is triggered through ensureCoordinatorReady? Other paths may use the future from a call to ensureCoordinatorReady. Conversely, we may use a future which was sent through another path here. Could we instead move this logic to lookupCoordinator so that it is handled consistently?

@hachikuji I actually did it intentionally: if we move the logic to the callee then any client.poll may trigger the callback and potentially throw the exception other than just consumer.poll as we are testing in the unit test. Currently there are three callers of lookupCoordinator and AFAIK the other two do not need to check if the future failure is non-retriable and hence need to be thrown. Of course if in the future we add another caller which does want to check then this would be vulnerable.

So I'd propose that if we make this behavior consistent inside lookupCoordinator, then we need to mark all public APIs that may leads to it to @throw those exceptions (today consumer.poll and some other APIs do have the marks, but not all of them). WDYT?

Regardless of where the lookupCoordinator is triggered, we are only raising it from ensureCoordinatorReady, so I am not sure I follow the point about raising from other contexts.

Note there doesn't appear to be any logic preventing multiple listeners from getting attached to the future. I think it would be better to always attach the listener when the future is created.

hachikuji · 2020-01-06T17:26:10Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

+
+                    log.info("Setting offset for partition {} to the committed offset {}", tp, position);
+                }
+


nit: remove newlines

hachikuji · 2020-01-06T17:31:08Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

                            // need to reset generation and re-join group
                            resetGenerationOnResponseError(ApiKeys.OFFSET_COMMIT, error);
                            future.raise(new CommitFailedException());
                            return;
+                        } else if (error == Errors.ILLEGAL_GENERATION) {
+                            if (this.generation.equals(ConsumerCoordinator.this.generation())) {


How much effort would it be to write a test case which hits this path?

I tried to add the test but found it may not be a valid case actually: we think this can happen when a join request is sent, and then a commit request is sent, and then a join response is received and then a commit response is received all from the same socket.

However when a join-request is sent we already transit to the REBALANCING state, and then in sendOffsetCommitRequest above: https://github.com/apache/kafka/pull/7312/files#diff-e9c1ee46a19a8684d9d8d8a8c77f9005R1067

we would immediately fail with a RetriableCommitFailure exception: if it is called with an async commit call, it would mark it as failure, if it is called with a sync commit, that exception would be thrown.

So it sounds to me that we do not need this specific handling since we should actually never hit this scenario?

hachikuji · 2020-01-06T17:33:33Z

core/src/test/scala/integration/kafka/api/EndToEndAuthorizationTest.scala

    consumer.assign(List(tp, tp2).asJava)
    sendRecords(producer, numRecords, tp2)
+    var topic2RecordConsumed = false


I wonder if we really need to be testing with two separate topics here. We already have a hard time with the flakiness of this test.

…turn-data-during-rebalance

guozhangwang · 2020-01-08T01:27:53Z

@hachikuji I have made another commit, which moves back to RebalanceInProgressException from RetriableCommitFailedException. But the latter cannot be deprecated still since it is used internally within commitAsync. Also reverted the illegal generation handling. Hopefully this is the last commit :)

guozhangwang · 2020-01-08T19:24:19Z

Looked into kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup and it seems irrelevant.

hachikuji

Thanks, left a few more comments.

hachikuji · 2020-01-08T23:09:32Z

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

+     * @throws org.apache.kafka.clients.consumer.RetriableCommitFailedException if the commit failed but can be retried.
+     *            This can occur if, e.g. consumer instance is in the middle of a rebalance so it is not yet deteremined
+     *            which partitions would be assigned to the consumer yet. In such cases you can first complete the rebalance
+     *            by calling {@link #poll(Duration)} and retry committing offsets again. NOTE when you retry after the


I think we should not say that the commit can be retried. I would just say that the rebalance needs to be completed by calling poll() and that offsets to commit can be reconsidered after the group is rejoined.

hachikuji · 2020-01-08T23:10:27Z

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

            log.debug("Committing offsets: {}", offsets);
            offsets.forEach(this::updateLastSeenEpochIfNewer);
            coordinator.commitOffsetsAsync(new HashMap<>(offsets), callback);
+        } catch (CommitFailedException e) {
+            log.error("Failed to commit offsets asynchronously because they do not belong to dynamically assigned partitions");


Do we still need this catch since we reverted the logic to verify only assigned partitions can be committed?

hachikuji · 2020-01-08T23:11:53Z

clients/src/main/java/org/apache/kafka/clients/consumer/RetriableCommitFailedException.java

    public RetriableCommitFailedException(Throwable t) {
        super("Offset commit failed with a retriable exception. You should retry committing " +
                "the latest consumed offsets.", t);
    }
-
-    public RetriableCommitFailedException(String message) {


Hmm.. Are we safe to remove these given that this is a public API. It's probably unlikely anyone is using them, but still..

Oh I was not sure we allow public classes to be used in our API "contract" :P Anyhow, I don't feel strong about it and I can revert.

hachikuji · 2020-01-08T23:15:09Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

@@ -130,11 +130,12 @@
    private MemberState state = MemberState.UNJOINED;
    private HeartbeatThread heartbeatThread = null;
    private RequestFuture<ByteBuffer> joinFuture = null;
+    private RequestFuture<Void> findCoordinatorFuture = null;
+    private RuntimeException findCoordinatorException = null;


We probably need to either make this volatile or an AtomicReference.

hachikuji · 2020-01-08T23:17:11Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

            final RequestFuture<Void> future = lookupCoordinator();
            client.poll(future, timer);

            if (!future.isDone()) {
                // ran out of time
+                future.addListener(new RequestFutureListener<Void>() {


Regardless of where the lookupCoordinator is triggered, we are only raising it from ensureCoordinatorReady, so I am not sure I follow the point about raising from other contexts.

Note there doesn't appear to be any logic preventing multiple listeners from getting attached to the future. I think it would be better to always attach the listener when the future is created.

hachikuji · 2020-01-08T23:18:24Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

                entry.getValue().leaderEpoch().ifPresent(epoch -> this.metadata.updateLastSeenEpochIfNewer(entry.getKey(), epoch));
-                this.subscriptions.seekUnvalidated(tp, position);
+
+                // it's possible that the partition is no longer assigned when the response is received,


Do you think it's worth adding a debug message saying that we're ignoring the fetched offset?

hachikuji · 2020-01-08T23:20:23Z

clients/src/main/java/org/apache/kafka/common/errors/RebalanceInProgressException.java

@@ -19,19 +19,7 @@
 public class RebalanceInProgressException extends ApiException {
    private static final long serialVersionUID = 1L;

-    public RebalanceInProgressException() {


Same here. I think we should probably leave these constructors around. They are not really doing any harm.

hachikuji · 2020-01-09T18:20:34Z

clients/src/main/java/org/apache/kafka/clients/consumer/RetriableCommitFailedException.java

@@ -22,8 +22,22 @@

    private static final long serialVersionUID = 1L;

+    public static RetriableCommitFailedException withUnderlyingMessage(String additionalMessage) {


I'm ok removing this API in spite of the compatibility concern. It's just that the other constructors are "standard" exception constructors and we have no real need to remove them.

hachikuji · 2020-01-09T18:20:56Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

@@ -274,8 +264,20 @@ public void onFailure(RuntimeException e) {
            if (node == null) {
                log.debug("No broker available to send FindCoordinator request");
                return RequestFuture.noBrokersAvailable();
-            } else
+            } else{


nit: this probably breaks checkstyle

hachikuji · 2020-01-09T18:22:55Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

@@ -769,6 +769,9 @@ public boolean refreshCommittedOffsetsIfNeeded(Timer timer) {
                    this.subscriptions.seekUnvalidated(tp, position);

                    log.info("Setting offset for partition {} to the committed offset {}", tp, position);
+                } else {
+                    log.info("Ignoring the returned {} since its partition {} is no longer assigned",


nit: I'd suggest "Ignoring the fetched committed offset"

I originally did that, but then I realize OffsetAndMetadata#toString contains this already so I decided to avoid duplicated wording.

hachikuji

@guozhangwang LGTM, thanks for the carrying this through. Note there is a failure in testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup which is probably worth checking out before we merge.

guozhangwang · 2020-01-09T23:46:00Z

Re-run the failed tests locally and it seems stable (30+ runs), will create the flaky test JIRA and merge.

After #7312, we could still return data during the rebalance phase, which means it could be possible to find records without corresponding tasks. We have to fallback to the unsubscribe mode during task migrated as the assignment should be cleared out to keep sync with task manager state. Reviewers: A. Sophie Blee-Goldman <[email protected]>, Guozhang Wang <[email protected]>

guozhangwang added 2 commits September 6, 2019 16:39

first version

37b5c63

add unit tests

4a34116

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

d964cea

…turn-data-during-rebalance

ConcurrencyPractitioner reviewed Sep 10, 2019

View reviewed changes

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

f5b8f96

…turn-data-during-rebalance

guozhangwang added 5 commits September 12, 2019 11:00

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

b07bac2

…turn-data-during-rebalance

fix unit tests

29f8942

add RebalanceInProgressException

3000d6f

rebase from trunk

8464fa7

refactor unit tests

f804b3d

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

3c62029

…turn-data-during-rebalance

poll with 1ms than 0ms for discover coordinator

284ef77

guozhangwang commented Sep 18, 2019

View reviewed changes

refactor the newly added test with auto-tick 1ms

9825063

cadonna reviewed Sep 25, 2019

View reviewed changes

ableegoldman mentioned this pull request Sep 26, 2019

KAFKA-8179: Part 7, cooperative rebalancing in Streams #7386

Merged

guozhangwang added 3 commits October 21, 2019 11:03

rebase from trunk

929048a

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

58adc09

…turn-data-during-rebalance

fix unit tests

c79a93d

guozhangwang added 4 commits January 2, 2020 15:05

address comment

27d013a

do not reset if generation has changed

94611cc

remove restrictive check on commit

b529847

check assignment

33fd785

guozhangwang commented Jan 3, 2020

View reviewed changes

hachikuji reviewed Jan 3, 2020

View reviewed changes

guozhangwang added 3 commits January 3, 2020 14:25

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

1df9808

…turn-data-during-rebalance

more comments

3057405

remember the fatal exception and throw

21f531d

hachikuji reviewed Jan 6, 2020

View reviewed changes

guozhangwang added 4 commits January 6, 2020 09:56

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

effb29d

…turn-data-during-rebalance

comments

6fbd976

Merge branch 'trunk' of https://github.com/apache/kafka into K8421-re…

f4e7111

…turn-data-during-rebalance

move back to RebalanceInProgressException

91b6606

hachikuji reviewed Jan 8, 2020

View reviewed changes

address comments

0ad52ec

hachikuji reviewed Jan 9, 2020

View reviewed changes

rebase from trunk

27f76b7

hachikuji approved these changes Jan 9, 2020

View reviewed changes

guozhangwang merged commit 505e824 into apache:trunk Jan 9, 2020

abbccdda mentioned this pull request Mar 4, 2020

KAFKA-9645: Fallback to unsubscribe during Task Migrated #8220

Merged

3 tasks

guozhangwang deleted the K8421-return-data-during-rebalance branch April 24, 2020 23:43

ocadaruma mentioned this pull request May 9, 2023

Subscription could be killed by RebalanceInProgressException line/decaton#201

Closed

		throw new IllegalArgumentException("Attempt to dynamically assign partitions while manual assignment in use");


		Map<TopicPartition, TopicPartitionState> assignedPartitionStates = partitionToStateMap(assignments);

	public boolean poll(RequestFuture<?> future, Timer timer) {
	do {
	poll(timer, future);
	} while (!future.isDone() && timer.notExpired());
	return future.isDone();
	}


		log.info("Setting offset for partition {} to the committed offset {}", tp, position);
		}

		@@ -22,8 +22,22 @@

		private static final long serialVersionUID = 1L;

		public static RetriableCommitFailedException withUnderlyingMessage(String additionalMessage) {

KAFKA-8421: Still return data during rebalance #7312

KAFKA-8421: Still return data during rebalance #7312

Conversation

guozhangwang commented Sep 7, 2019

Committer Checklist (excluded from commit message)

guozhangwang commented Sep 7, 2019

guozhangwang commented Sep 9, 2019

guozhangwang commented Sep 10, 2019

guozhangwang commented Sep 10, 2019

ableegoldman commented Sep 10, 2019

guozhangwang commented Sep 10, 2019

hachikuji commented Sep 10, 2019 • edited Loading

ConcurrencyPractitioner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConcurrencyPractitioner Sep 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Sep 12, 2019 • edited Loading

guozhangwang commented Sep 17, 2019

guozhangwang commented Sep 17, 2019

guozhangwang commented Sep 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cadonna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Jan 5, 2020

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Jan 8, 2020

guozhangwang commented Jan 8, 2020

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

guozhangwang commented Jan 9, 2020

hachikuji commented Sep 10, 2019 •

edited

Loading

ConcurrencyPractitioner Sep 12, 2019 •

edited

Loading

guozhangwang commented Sep 12, 2019 •

edited

Loading