Add topic name as a column in the Kafka Input format #14857

abhishekagarwal87 · 2023-08-17T07:23:43Z

This PR adds a way to store the topic name in a column. Such a column can be used to distinguish messages coming from different topics in multi-topic ingestion.

Release notes
You can now optionally ingest the name of the Kafka topic to the datasource. It is particularly helpful when datasource is getting data from multiple Kafka topics.

This PR has:

kfaraz · 2023-08-17T11:36:27Z

...exing-service/src/test/java/org/apache/druid/data/input/kafkainput/KafkaInputFormatTest.java

@@ -683,7 +700,8 @@ public void testMissingTimestampThrowsException() throws IOException
      while (iterator.hasNext()) {
        Throwable t = Assert.assertThrows(ParseException.class, () -> iterator.next());
        Assert.assertEquals(
-            "Timestamp[null] is unparseable! Event: {foo=x, kafka.newts.timestamp=1624492800000, kafka.newkey.key=sampleKey, root_baz=4, bar=null, kafka...",
+            "Timestamp[null] is unparseable! Event: {kafka.newtopic.topic=sample, foo=x, kafka.newts"


Nit: Maybe break the line right before kafka.newts.timestamp or before Event:

kfaraz · 2023-08-17T11:37:13Z

...exing-service/src/test/java/org/apache/druid/data/input/kafkainput/KafkaInputFormatTest.java

@@ -59,6 +59,7 @@ public class KafkaInputFormatTest
 {
  private KafkaRecordEntity inputEntity;
  private final long timestamp = DateTimes.of("2021-06-24").getMillis();
+  private final String TOPIC = "sample";


Suggested change

private final String TOPIC = "sample";

private static final String TOPIC = "sample";

kfaraz · 2023-08-17T11:38:24Z

...-indexing-service/src/main/java/org/apache/druid/data/input/kafkainput/KafkaInputReader.java

+    // Add kafka record topic to the mergelist, we will skip record topic if the same key exists already in
+    // the header list


Nit:

Suggested change

// Add kafka record topic to the mergelist, we will skip record topic if the same key exists already in

// the header list

// Add kafka record topic to the mergelist, only if the key doesn't already exist

kfaraz

This should be very helpful for debugging! Left some minor comments.

Edit: For ingestion from single topic, this column would have redundant data leading to unnecessary storage costs. How can we avoid populating it in such cases while still populating the other metadata columns such as kafka.header.x and kafka.timestamp?

We should also include a release note in the PR description and update the docs.

vogievetsky · 2023-08-17T18:26:59Z

Does it make sense to update the API docs in this PR also?

abhishekagarwal87 · 2023-08-18T09:10:07Z

This should be very helpful for debugging! Left some minor comments.

Edit: For ingestion from single topic, this column would have redundant data leading to unnecessary storage costs. How can we avoid populating it in such cases while still populating the other metadata columns such as kafka.header.x and kafka.timestamp?

We should also include a release note in the PR description and update the docs.

@kfaraz - I suppose you just wouldn't add the column to your dimension spec. it's just like input format exposing 11 columns instead of 10 but you can choose to ingest only 10 of those. @vogievetsky - maybe we can make topic column optional in the console or detect in console if topicPattern is set and only then ask for kafka topic column name?

abhishekagarwal87 · 2023-08-18T09:10:39Z

@vogievetsky - I will make the docs changes in a separate PR.

abhishekagarwal87 · 2023-08-21T08:19:11Z

@kfaraz - one thing I can do is to keep the default kafka topic name as null. So unless explicitly specific, this column will not be populated. what do you think?

kfaraz · 2023-08-21T09:03:36Z

the default kafka topic name as null

Do you mean the DEFAULT_TOPIC_COLUMN_NAME would be null and we check for null before adding it to the merged header? If yes, then I think that makes sense.

Also, I guess what you had previously suggested wouldn't work, right? Because we seem to be adding the metadata columns even if they are not specified in the dimensionsSpec.

you just wouldn't add the column to your dimension spec. it's just like input format exposing 11 columns instead of 10 but you can choose to ingest only 10 of those.

abhishekagarwal87 · 2023-08-21T11:00:48Z

@kfaraz - I read it a bit more. So these fields are populated if you set the input format to Kafka. If you leave the input format same as underlying data e.g. avro, these metadata fields are not populated. If you set the input format to Kafka and still want to not ingest this column, there are following ways

add kafka.topic to dimension exclusion list
disable auto-discovery by setting useSchemaDiscovery and includeAllDimensions to false and don't add the column to ingestion spec

Now that we know above, I think it's ok to not do special handling for topic column.

kfaraz

LGTM 🚀

abhishekagarwal87 added Area - Streaming Ingestion Needs web console change Backend API changes that would benefit from frontend support in the web console Release Notes labels Aug 17, 2023

kfaraz reviewed Aug 17, 2023

View reviewed changes

vogievetsky mentioned this pull request Aug 17, 2023

Web console: add Kafka topic column controls #14865

Merged

abhishekagarwal87 added 2 commits August 21, 2023 13:41

Add topic name as a column

1113cc3

Review comments

8260b97

abhishekagarwal87 force-pushed the kafka_topic_column branch from 4682623 to 8260b97 Compare August 21, 2023 08:17

kfaraz approved these changes Aug 21, 2023

View reviewed changes

abhishekagarwal87 merged commit a38b4f0 into apache:master Aug 21, 2023
46 checks passed

LakshSingla added this to the 28.0 milestone Oct 12, 2023

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

vogievetsky removed the Needs web console change Backend API changes that would benefit from frontend support in the web console label Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add topic name as a column in the Kafka Input format #14857

Add topic name as a column in the Kafka Input format #14857

abhishekagarwal87 commented Aug 17, 2023 •

edited

Loading

kfaraz Aug 17, 2023

kfaraz Aug 17, 2023

kfaraz Aug 17, 2023

kfaraz left a comment •

edited

Loading

vogievetsky commented Aug 17, 2023

abhishekagarwal87 commented Aug 18, 2023

abhishekagarwal87 commented Aug 18, 2023

abhishekagarwal87 commented Aug 21, 2023

kfaraz commented Aug 21, 2023

abhishekagarwal87 commented Aug 21, 2023

kfaraz left a comment

	private final String TOPIC = "sample";
	private static final String TOPIC = "sample";

		// Add kafka record topic to the mergelist, we will skip record topic if the same key exists already in
		// the header list

Add topic name as a column in the Kafka Input format #14857

Add topic name as a column in the Kafka Input format #14857

Conversation

abhishekagarwal87 commented Aug 17, 2023 • edited Loading

kfaraz Aug 17, 2023

Choose a reason for hiding this comment

kfaraz Aug 17, 2023

Choose a reason for hiding this comment

kfaraz Aug 17, 2023

Choose a reason for hiding this comment

kfaraz left a comment • edited Loading

Choose a reason for hiding this comment

vogievetsky commented Aug 17, 2023

abhishekagarwal87 commented Aug 18, 2023

abhishekagarwal87 commented Aug 18, 2023

abhishekagarwal87 commented Aug 21, 2023

kfaraz commented Aug 21, 2023

abhishekagarwal87 commented Aug 21, 2023

kfaraz left a comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Aug 17, 2023 •

edited

Loading

kfaraz left a comment •

edited

Loading