Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topic name as a column in the Kafka Input format #14857

Merged
merged 2 commits into from
Aug 21, 2023

Conversation

abhishekagarwal87
Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 commented Aug 17, 2023

This PR adds a way to store the topic name in a column. Such a column can be used to distinguish messages coming from different topics in multi-topic ingestion.

Release notes
You can now optionally ingest the name of the Kafka topic to the datasource. It is particularly helpful when datasource is getting data from multiple Kafka topics.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87 abhishekagarwal87 added Area - Streaming Ingestion Needs web console change Backend API changes that would benefit from frontend support in the web console Release Notes labels Aug 17, 2023
@@ -683,7 +700,8 @@ public void testMissingTimestampThrowsException() throws IOException
while (iterator.hasNext()) {
Throwable t = Assert.assertThrows(ParseException.class, () -> iterator.next());
Assert.assertEquals(
"Timestamp[null] is unparseable! Event: {foo=x, kafka.newts.timestamp=1624492800000, kafka.newkey.key=sampleKey, root_baz=4, bar=null, kafka...",
"Timestamp[null] is unparseable! Event: {kafka.newtopic.topic=sample, foo=x, kafka.newts"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe break the line right before kafka.newts.timestamp or before Event:

@@ -59,6 +59,7 @@ public class KafkaInputFormatTest
{
private KafkaRecordEntity inputEntity;
private final long timestamp = DateTimes.of("2021-06-24").getMillis();
private final String TOPIC = "sample";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final String TOPIC = "sample";
private static final String TOPIC = "sample";

Comment on lines 134 to 135
// Add kafka record topic to the mergelist, we will skip record topic if the same key exists already in
// the header list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
// Add kafka record topic to the mergelist, we will skip record topic if the same key exists already in
// the header list
// Add kafka record topic to the mergelist, only if the key doesn't already exist

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be very helpful for debugging! Left some minor comments.

Edit: For ingestion from single topic, this column would have redundant data leading to unnecessary storage costs. How can we avoid populating it in such cases while still populating the other metadata columns such as kafka.header.x and kafka.timestamp?

We should also include a release note in the PR description and update the docs.

@vogievetsky
Copy link
Contributor

Does it make sense to update the API docs in this PR also?

@abhishekagarwal87
Copy link
Contributor Author

This should be very helpful for debugging! Left some minor comments.

Edit: For ingestion from single topic, this column would have redundant data leading to unnecessary storage costs. How can we avoid populating it in such cases while still populating the other metadata columns such as kafka.header.x and kafka.timestamp?

We should also include a release note in the PR description and update the docs.

@kfaraz - I suppose you just wouldn't add the column to your dimension spec. it's just like input format exposing 11 columns instead of 10 but you can choose to ingest only 10 of those. @vogievetsky - maybe we can make topic column optional in the console or detect in console if topicPattern is set and only then ask for kafka topic column name?

@abhishekagarwal87
Copy link
Contributor Author

@vogievetsky - I will make the docs changes in a separate PR.

@abhishekagarwal87
Copy link
Contributor Author

@kfaraz - one thing I can do is to keep the default kafka topic name as null. So unless explicitly specific, this column will not be populated. what do you think?

@kfaraz
Copy link
Contributor

kfaraz commented Aug 21, 2023

the default kafka topic name as null

Do you mean the DEFAULT_TOPIC_COLUMN_NAME would be null and we check for null before adding it to the merged header? If yes, then I think that makes sense.

Also, I guess what you had previously suggested wouldn't work, right? Because we seem to be adding the metadata columns even if they are not specified in the dimensionsSpec.

you just wouldn't add the column to your dimension spec. it's just like input format exposing 11 columns instead of 10 but you can choose to ingest only 10 of those.

@abhishekagarwal87
Copy link
Contributor Author

@kfaraz - I read it a bit more. So these fields are populated if you set the input format to Kafka. If you leave the input format same as underlying data e.g. avro, these metadata fields are not populated. If you set the input format to Kafka and still want to not ingest this column, there are following ways

  • add kafka.topic to dimension exclusion list
  • disable auto-discovery by setting useSchemaDiscovery and includeAllDimensions to false and don't add the column to ingestion spec

Now that we know above, I think it's ok to not do special handling for topic column.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@abhishekagarwal87 abhishekagarwal87 merged commit a38b4f0 into apache:master Aug 21, 2023
46 checks passed
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
@vogievetsky vogievetsky removed the Needs web console change Backend API changes that would benefit from frontend support in the web console label Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants