KAFKA-4228 - make producer close on sender thread death, make consumer shutdown on failure to rebalance, and make MM die on any of the above. #1930

radai-rosenblatt · 2016-09-29T00:27:40Z

the JIRA issue (https://issues.apache.org/jira/browse/KAFKA-4228) details a cascade of failures that resulted in an entire mirror maker cluster stalling due to an OOM death on one mm instance.

this patch makes producers and consumers close themselves on the errors encountered, and mm to shut down if anything happens to producers or consumers.

Signed-off-by: radai-rosenblatt [email protected]

becketqin · 2016-09-29T00:38:48Z

core/src/main/scala/kafka/tools/MirrorMaker.scala

@@ -434,6 +434,10 @@ object MirrorMaker extends Logging with KafkaMetricsGroup {
              records.foreach(producer.send)
              maybeFlushAndCommitOffsets()
            }
+            if (!mirrorMakerConsumer.hasData && !shuttingDown) {


Maybe we need to add check !exitingOnSendFailure as well? Otherwise we may throw IllegalStateException when there is a send failure.

a send failure closes the producer, and not the consumer, meaning mirrorMakerConsumer.hasData will return true and this wont be called. its possible for there to be both a send failure and consumer death at the same time, in which case which one kills the mm is semantics :-). will update.

jjkoshy · 2016-09-29T18:35:21Z

clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java

+                @Override
+                public void uncaughtException(Thread t, Throwable e) {
+                    try {
+                        log.error("thread " + t.getName() + " died to uncaught exception", e);


Capitalize 't'
Also, died due to

jjkoshy · 2016-09-29T18:41:29Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java

-                run(time.milliseconds());
-            } catch (Exception e) {
-                log.error("Uncaught error in kafka producer I/O thread: ", e);
+        boolean gracefullShutdown = false;


jjkoshy · 2016-09-29T18:47:01Z

core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala

@@ -588,7 +588,10 @@ private[kafka] class ZookeeperConsumerConnector(val config: ConsumerConfig,
            if (doRebalance)
              syncedRebalance
          } catch {
-            case t: Throwable => error("error during syncedRebalance", t)
+            case t: Throwable => {
+              error("error during syncedRebalance", t)


jjkoshy · 2016-09-29T19:01:02Z

core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala

-            case t: Throwable => error("error during syncedRebalance", t)
+            case t: Throwable => {
+              error("error during syncedRebalance", t)
+              shutdown()


This is a somewhat significant change in the old consumer. Previously the consumer would actually stay alive after a series of rebalance failures and could succeed on a successful subsequent rebalance trigger. So for e.g., with this change one consumer that has high latency issues with zookeeper and delay in releasing partition ownership could cause all the other (normal) consumers to shutdown.

Even in the new consumer a consumer that is problematic within the group is not shutdown. It is marked as failed but it can definitely rejoin the group.

the problem is youre not guaranteed when the next rebalance will happen, if at all.
some here's a non-hypothetical scenario - a consumer has failed to rebalance, but remains "alive" (its still registered in ZK). the rest of the consumers succeed to rebalance and leave a chunk of partitions to the failed consumer (because its still alive on ZK).

and you end up with orphaned partitions for an unbounded amount of time.

the choices I can see are:

suicide if you cannot rebalance. your death will trigger a further rebalance and all partitions should be distributed among survivors.

trigger another rebalance. the issue is that you got the exception because the user defined a fixed max number of attempts. you'd be violating this setting, as this si effectively indefinite retries.

at least release all partitions and unregister yourself from ZK. this is almost like Switch to using scala 2.9.2 #1 only you remain in some interesting zombie state?!

I agree that it is a weird state to be in and shutting down is not unreasonable. However, this is a major change in behavior especially given the decentralized coordination in the old consumer. The main issue that I see is that one misbehaving client that is slow to release partitions can cause the good clients to shutdown.

Specifically at LinkedIn, one common occurrence of rebalance failures is when a developer starts up a consumer on his development machine that joins a group in a staging environment (i.e., in a remote DC) - sure developers should use a different group but for various reasons this happens. This actually causes rebalance failures in the staging instances. If those instances were to shutdown the staging services would need to be bounced (and some of them may even die depending on how they handle the consumer's lifecycle).

So it seems at the very least we should make such a change configurable and default to the old behavior.

ijuma · 2016-09-30T12:35:16Z

core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala

@@ -588,7 +588,10 @@ private[kafka] class ZookeeperConsumerConnector(val config: ConsumerConfig,
            if (doRebalance)
              syncedRebalance
          } catch {
-            case t: Throwable => error("error during syncedRebalance", t)
+            case t: Throwable => {


Nitpick: you don't need braces in case blocks in Scala.

ijuma · 2016-09-30T12:35:51Z

cc @junrao @hachikuji

…r shutdown on failure to rebalance, and make MM die on any of the above. Signed-off-by: radai-rosenblatt <[email protected]>

becketqin reviewed Sep 29, 2016

View reviewed changes

radai-rosenblatt force-pushed the honorable-death branch from efca6e8 to 3b748d8 Compare September 29, 2016 00:41

jjkoshy reviewed Sep 29, 2016

View reviewed changes

radai-rosenblatt force-pushed the honorable-death branch 2 times, most recently from 382e567 to 459ad81 Compare September 29, 2016 19:40

ijuma reviewed Sep 30, 2016

View reviewed changes

KAFKA-4228 - make producer close on sender thread death, make consume…

3b8ca4c

…r shutdown on failure to rebalance, and make MM die on any of the above. Signed-off-by: radai-rosenblatt <[email protected]>

radai-rosenblatt force-pushed the honorable-death branch from 459ad81 to 3b8ca4c Compare September 30, 2016 22:56

radai-rosenblatt closed this Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-4228 - make producer close on sender thread death, make consumer shutdown on failure to rebalance, and make MM die on any of the above. #1930

KAFKA-4228 - make producer close on sender thread death, make consumer shutdown on failure to rebalance, and make MM die on any of the above. #1930

radai-rosenblatt commented Sep 29, 2016

becketqin Sep 29, 2016

radai-rosenblatt Sep 29, 2016

jjkoshy Sep 29, 2016

radai-rosenblatt Sep 29, 2016 •

edited

Loading

jjkoshy Sep 29, 2016

radai-rosenblatt Sep 29, 2016 •

edited

Loading

jjkoshy Sep 29, 2016

radai-rosenblatt Sep 29, 2016

jjkoshy Sep 29, 2016

radai-rosenblatt Sep 29, 2016 •

edited

Loading

jjkoshy Sep 30, 2016

radai-rosenblatt Sep 30, 2016

ijuma Sep 30, 2016

radai-rosenblatt Sep 30, 2016

ijuma commented Sep 30, 2016

KAFKA-4228 - make producer close on sender thread death, make consumer shutdown on failure to rebalance, and make MM die on any of the above. #1930

KAFKA-4228 - make producer close on sender thread death, make consumer shutdown on failure to rebalance, and make MM die on any of the above. #1930

Conversation

radai-rosenblatt commented Sep 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radai-rosenblatt Sep 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radai-rosenblatt Sep 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radai-rosenblatt Sep 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijuma commented Sep 30, 2016

radai-rosenblatt Sep 29, 2016 •

edited

Loading

radai-rosenblatt Sep 29, 2016 •

edited

Loading

radai-rosenblatt Sep 29, 2016 •

edited

Loading