KAFKA-9615: Clean up task/producer create and close #8213

vvcephei · 2020-03-03T22:16:50Z

Consolidates task/producer management.

Now, exactly one component manages the creation and destruction of Producers,
whether they are per-thread or per-task.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

vvcephei

@guozhangwang , do you mind taking a look at this refactor?

vvcephei · 2020-03-03T22:17:54Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+
+import static org.apache.kafka.streams.StreamsConfig.EXACTLY_ONCE;
+
+class ActiveTaskCreator {


Pulled out of StreamThread (could have been done a long time ago, since it was a static class anyway). I didn't embed it in TaskManager, just to keep the file size lower.

I also dropped AbstractTaskCreator, since the creation of Active and Standby tasks are only similar, not exactly the same. We weren't really using the abstraction for much except de-duplicating a few field declarations. On the con side, the abstraction made it hard to see that we were requiring several arguments for Standby task creation that were actually not ever used. The indirection also made it harder to read the task creation logic.

Sounds good to me

vvcephei · 2020-03-03T22:22:57Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollector.java

@@ -25,7 +25,7 @@

 import java.util.Map;

-public interface RecordCollector extends AutoCloseable {


We never used it as an AutoCloseable, and having it makes it hard to trace the callers.

vvcephei · 2020-03-03T22:23:41Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

@@ -262,7 +262,7 @@ public void close() {
        if (eosEnabled) {
            streamsProducer.abortTransaction();
        }
-        streamsProducer.close();


Only the creator may close the producer, but we can go ahead and call flush() here.

vvcephei · 2020-03-03T22:24:50Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java

@@ -544,11 +306,7 @@ public static StreamThread create(final InternalTopologyBuilder builder,

        final ThreadCache cache = new ThreadCache(logContext, cacheSizeBytes, streamsMetrics);

-        final Map<TaskId, Producer<byte[], byte[]>> taskProducers = new HashMap<>();


Managed fully inside the task factory now.

You'll also notice a bunch of references to the producers are similarly gone in the following lines.

vvcephei · 2020-03-03T22:26:29Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java

-            threadProducer == null ?
-                Collections.emptySet() :
-                Collections.singleton(getThreadProducerClientId(this.getName())),


Sooo many switch statements are now gone.

vvcephei · 2020-03-03T22:27:22Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java

-            }
-        }
-        return result;
+        return taskManager.producerMetrics();


We don't manage the producers anymore, so we can defer to the taskManager (who will defer to the active task creator, but that's none of the thread's business)

vvcephei · 2020-03-03T22:29:33Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsProducer.java

+    public StreamsProducer(final Producer<byte[], byte[]> producer,
+                           final boolean eosEnabled,
+                           final LogContext logContext,
+                           final String applicationId) {


This is a bit on the side, but there was some wacky stuff going on in here, and afaict, the only purpose of all the nullable fields was to allow including the task id in exception messages. Do we really need to do that? If so, I'll just add the logContext to the exception message instead, since it already has the task id in it.

Actually the logContext.logPrefix() should have the format of stream-thread [%s] task [%s] already, so all the log4j entries are good. For exception messages, we can just get the prefix and then encode that into the exception message.

Yep, that's what I was thinking. I'll go ahead and do it.

vvcephei · 2020-03-03T22:30:33Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsProducer.java

-                    "Error encountered sending record to topic %s%s due to:%n%s",
-                    record.topic(),
-                    taskId == null ? "" : " " + logMessage,
-                    uncaughtException.toString());


no need to add the exception to the message, since we also pass it as the cause below.

vvcephei · 2020-03-03T22:31:18Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsProducer.java

@@ -212,17 +190,6 @@ public void flush() {
        producer.flush();
    }

-    public void close() {


No one who has a reference to the StreamsProducer has any business calling close, so the method is gone now.

vvcephei · 2020-03-03T22:33:21Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

@@ -227,7 +227,7 @@ public void handleAssignment(final Map<TaskId, Set<TopicPartition>> activeTasks,
        }

        if (!standbyTasksToCreate.isEmpty()) {
-            standbyTaskCreator.createTasks(mainConsumer, standbyTasksToCreate).forEach(this::addNewTask);
+            standbyTaskCreator.createTasks(standbyTasksToCreate).forEach(this::addNewTask);


Standby tasks don't need the mainConsumer (obvious, in retrospect).

guozhangwang

I made a pass over the PR.

guozhangwang · 2020-03-03T23:16:05Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

-    // Instead, we should register and record the metrics properly inside of the record collector.
-    Map<TaskId, StreamTask> fixmeStreamTasks() {
-        return tasks.values().stream().filter(t -> t instanceof StreamTask).map(t -> (StreamTask) t).collect(Collectors.toMap(Task::id, t -> t));
+    Map<MetricName, Metric> producerMetrics() {


nit: These two functions are not for testing only.

Thanks for the feedback; I didn't understand this particular comment, though.

The comment line above these two method declaration says the following functions are for test only, but these two functions are not.

Ah, I just found what you were talking about:

// below are for testing only

I didn't notice that up there.

guozhangwang · 2020-03-03T23:21:56Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

-                final StreamThread.AbstractTaskCreator<? extends Task> activeTaskCreator,
-                final StreamThread.AbstractTaskCreator<? extends Task> standbyTaskCreator,
-                final Map<TaskId, Producer<byte[], byte[]>> taskProducers,
+                final ActiveTaskCreator activeTaskCreator,


In shutdown(final boolean clean) we should also release task producers as well right?

Yep. Good catch.

guozhangwang · 2020-03-03T23:24:12Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java

-            threadProducer == null ?
-                Collections.emptySet() :
-                Collections.singleton(getThreadProducerClientId(this.getName())),


guozhangwang · 2020-03-03T23:28:14Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsProducer.java

+    public StreamsProducer(final Producer<byte[], byte[]> producer,
+                           final boolean eosEnabled,
+                           final LogContext logContext,
+                           final String applicationId) {


Actually the logContext.logPrefix() should have the format of stream-thread [%s] task [%s] already, so all the log4j entries are good. For exception messages, we can just get the prefix and then encode that into the exception message.

guozhangwang · 2020-03-03T23:42:26Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

@@ -262,7 +262,7 @@ public void close() {
        if (eosEnabled) {
            streamsProducer.abortTransaction();
        }
-        streamsProducer.close();
+        streamsProducer.flush();


I'm wondering if we are introducing a latency regression without EOS here: in the old code when closing without EOS we actually do nothing, and now we would block on flushing.

On the other hand, flushing maybe needed when we close a task to make sure all the tasks' records are acked already.

If the task is in RUNNING before shutting down, we would always commit before closing, so flush is already called; if the task is in RESTORING / SUSPENDED there's nothing written from this task, so a flush is not needed. So I think it is safe to not call flush after all.

Sounds reasonable to me

Ok, I'll swap this out.

guozhangwang · 2020-03-03T23:52:19Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/StreamsProducerTest.java

@@ -612,25 +582,4 @@ public void shouldFailOnEosAbortTxFatal() {

        assertThat(thrown.getMessage(), equalTo("KABOOM!"));
    }
-
-    @Test
-    public void shouldFailOnCloseFatal() {


We do not created new test classes of ActiveTaskCreator / StandbyTaskCreator, but we should still have those coverage to make sure the exception thrown from producers are wrapped correctly.

Also it seems in the new code we no long rethrow -- is that intentional. I left a comment above.

Good catch. Actually, there's a bunch of coverage missing from TaskManager. I'll add several more tests.

guozhangwang · 2020-03-03T23:53:38Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+            try {
+                threadProducer.close();
+            } catch (final RuntimeException e) {
+                log.error("Failed to close producer due to the following error:", e);


The old code we re-throw exceptions whereas here we just swallow the error. Is that intentional?

Ah, thanks for noticing this. I meant to ask about it in here.

I thought it was strange that we would only re-throw when closing the task producers, not the thread producer. It seems like we should do the same thing in both cases, but which thing should we do?

I went with an error log in both cases, but it sounds like you wanted to throw the exception instead. Should we also rethrow for the thread producer?

Yeah I think we should make them consistent: previously the closing producer is within task#close and if it is called via closeDirty we should make sure it never throws. Now since it is extracted out of the close call we should just rethrow for both cases.

guozhangwang · 2020-03-03T23:56:01Z

streams/src/test/java/org/apache/kafka/streams/state/KeyValueStoreTestDriver.java

@@ -201,7 +201,7 @@ private KeyValueStoreTestDriver(final StateSerdes<K, V> serdes) {
            logContext,
            new TaskId(0, 0),
            consumer,
-            new StreamsProducer(logContext, producer),
+            new StreamsProducer(producer, null != null, logContext, null),


null != null?

Oops! Auto-refactoring. I already made two passes to clean these up, looks like I missed one.

mjsax

Nice cleanup. There is quite some overlap with #8215 even if both PR address a different issue.

mjsax · 2020-03-04T00:05:01Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+        return createdTasks;
+    }
+
+    public void releaseProducer() {


closeThreadProducer()

The method below is not public -- does this one need to be public?

even better maybeCloseThreadProducer

mjsax · 2020-03-04T00:05:35Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+        }
+    }
+
+    void releaseProducer(final TaskId id) {


closeProducerForTask(TaskId) ?

mjsax · 2020-03-04T00:06:26Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+        }
+    }
+
+    public InternalTopologyBuilder builder() {


Does this need to be public?

Hah, doesn't need to be there at all, actually.

mjsax · 2020-03-04T00:06:30Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+        return builder;
+    }
+
+    public StateDirectory stateDirectory() {


Does this need to be public?

mjsax · 2020-03-04T00:09:53Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

@@ -262,7 +262,7 @@ public void close() {
        if (eosEnabled) {
            streamsProducer.abortTransaction();
        }
-        streamsProducer.close();
+        streamsProducer.flush();


Sounds reasonable to me

abbccdda · 2020-03-04T01:23:17Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+
+            if (threadProducer == null) {
+                // create one producer per task for EOS
+                // TODO: after KIP-447 this would be removed


I don't think we need to keep this TODO, as only after a stream 3.0 is there, we shall remove the support for task producer.

abbccdda · 2020-03-04T01:24:30Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+                log.error("Failed to close producer due to the following error:", e);
+            }
+        }
+        if (!taskProducers.isEmpty()) {


Should we throw illegal state first, since we are already in an error state?

If we're going to rename the method to specify that it should do exactly "close thread producer", then this check is no longer appropriate.

abbccdda · 2020-03-04T01:26:23Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+        return createdTasks;
+    }
+
+    public void releaseProducer() {


even better maybeCloseThreadProducer

vvcephei · 2020-03-04T23:44:33Z

Hey @guozhangwang , @mjsax , and @abbccdda ,

I've addressed all the feedback. In particular I added about 500 lines of missing tests to TaskManagerTest. It looks like we've done a fair amount of un-unit-tested refactoring in TaskManagerTest, so I figured I'd go ahead and add the missing coverage while I was adding coverage for this refactor itself. Be forewarned, the tests are... extensive.

guozhangwang

@vvcephei I made a pass over the testing code, and only have a couple minor comments.

After addressing them please feel free to merge.

guozhangwang · 2020-03-04T23:53:03Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+                try {
+                    activeTaskCreator.closeAndRemoveTaskProducerIfNeeded(task.id());
+                } catch (final RuntimeException e) {
+                    log.debug("Error handling lostAll", e);


Let's make it a warn instead of a debug.

The error message can be more specific here: Error closing task producer for task {} while handling lostAll.

guozhangwang · 2020-03-04T23:53:32Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+                    if (clean) {
+                        firstException.compareAndSet(null, e);
+                    } else {
+                        log.warn("Ignoring an exception while closing task producer.", e);


Ditto here about error message

guozhangwang · 2020-03-04T23:55:43Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/TaskManagerTest.java

@@ -440,6 +872,41 @@ public void shouldCommitActiveAndStandbyTasks() {
        assertThat(taskManager.commitAll(), equalTo(2));
    }

+    @Test
+    public void shouldNotCommitActiveAndStandbyTasks() {


nit: ...WhileRebalanceInProgress

abbccdda · 2020-03-05T17:23:05Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/StreamsProducerTest.java

-        assertTrue(eosMockProducer.history().isEmpty());
-        assertThat(eosMockProducer.uncommittedRecords().size(), equalTo(1));
-        assertThat(eosMockProducer.uncommittedRecords().get(0), equalTo(record));
+        assertThat(eosMockProducer.transactionInFlight(), is(true));


I was wondering what's the standard for using assertThat vs assertTrue? Do we have a convention to follow?

assertThat is nicer in general, but it doesn't really matter. In this case, IDEA offered to translate, and I was already changing a lot of assertions, so I just accepted the translation.

abbccdda · 2020-03-05T17:29:08Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/TaskManagerTest.java

@@ -24,14 +24,20 @@
 import org.apache.kafka.clients.consumer.Consumer;
 import org.apache.kafka.clients.consumer.ConsumerRecord;
 import org.apache.kafka.common.KafkaException;
+import org.apache.kafka.common.Metric;


Are the tests added in TaskManager only trying for more coverage? @vvcephei

yep, that's right.

abbccdda

LGTM, only have some minor comments about tests. After addressing them, feel free to merge.

vvcephei · 2020-03-05T20:19:13Z

The last commit was trivial, and the Streams tests passed locally for me, so I'm going to go ahead and merge.

John Roesler added 5 commits March 2, 2020 16:19

wip

578c68f

step 1: pull out task creators

5a10136

step 2: encapsulate producers

5aaf9da

step 3: clean up StreamsProducer

0a8c8fe

step 4: clean up abstract task creator

b63153a

vvcephei added the streams label Mar 3, 2020

vvcephei requested a review from guozhangwang March 3, 2020 22:16

final cleanup

f53ef6d

vvcephei commented Mar 3, 2020

View reviewed changes

guozhangwang reviewed Mar 3, 2020

View reviewed changes

mjsax reviewed Mar 4, 2020

View reviewed changes

abbccdda reviewed Mar 4, 2020

View reviewed changes

John Roesler added 6 commits March 4, 2020 11:37

format exception messages in StreamsProducer

dd7dce0

misc cr

977893a

fixed missing coverage in TaskManagerTest

46a0baa

fix style

3d51081

repair tests

7405aaf

catch producer close exceptions during unclean shutdown

26205cb

suppress complexity

2c4a2bc

guozhangwang approved these changes Mar 4, 2020

View reviewed changes

minor final CR feedback

2eb596b

abbccdda reviewed Mar 5, 2020

View reviewed changes

abbccdda approved these changes Mar 5, 2020

View reviewed changes

vvcephei merged commit 78374a1 into apache:trunk Mar 5, 2020

vvcephei deleted the KAFKA-9615-cleanup-task-producers-2 branch March 5, 2020 20:20


		import static org.apache.kafka.streams.StreamsConfig.EXACTLY_ONCE;

		class ActiveTaskCreator {

		@@ -25,7 +25,7 @@

		import java.util.Map;

		public interface RecordCollector extends AutoCloseable {

		@@ -544,11 +306,7 @@ public static StreamThread create(final InternalTopologyBuilder builder,

		final ThreadCache cache = new ThreadCache(logContext, cacheSizeBytes, streamsMetrics);

		final Map<TaskId, Producer<byte[], byte[]>> taskProducers = new HashMap<>();

KAFKA-9615: Clean up task/producer create and close #8213

KAFKA-9615: Clean up task/producer create and close #8213

Conversation

vvcephei commented Mar 3, 2020

Committer Checklist (excluded from commit message)

vvcephei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsax left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvcephei commented Mar 4, 2020

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abbccdda left a comment

Choose a reason for hiding this comment

vvcephei commented Mar 5, 2020