Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tombstone ingestion with schematization #700

Merged
merged 22 commits into from
Sep 20, 2023
Merged

Conversation

sfc-gh-rcheng
Copy link
Collaborator

@sfc-gh-rcheng sfc-gh-rcheng commented Sep 11, 2023

  • Enables schematization tombstone ingestion
  • Adds comment about a bug in ignoring snowpipe tombstone records

Test changes

  • Adds IT for snowpipe, streaming and schematization ingestion methods
  • Alters existing json e2e tests to ingest tombstone records by default
  • Adds new e2e tests to ignore tombstone records
  • Special cases Kafka version 2.5.1 due to a bug with tombstone records. Will need to make a note of this in the documentation

@@ -152,7 +152,7 @@ static Map<String, String> getColumnTypes(SinkRecord record, List<String> column
private static Map<String, String> getSchemaMapFromRecord(SinkRecord record) {
Map<String, String> schemaMap = new HashMap<>();
Schema schema = record.valueSchema();
if (schema != null) {
if (schema != null && schema.fields() != null) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added additional null check to ensure we don't NPE in the next for loop


// return empty if tombstone record
if (node.size() == 0
&& this.behaviorOnNullValues == SnowflakeSinkConnectorConfig.BehaviorOnNullValues.DEFAULT) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crux of the schematization tombstone change

"snowflake.database.name":"SNOWFLAKE_DATABASE",
"snowflake.schema.name":"SNOWFLAKE_SCHEMA",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using community JsonConverter instead of Snowflake customer converter due to this bug

@sfc-gh-rcheng sfc-gh-rcheng changed the title Rcheng tbschema Fix tombstone ingestion with schematization Sep 14, 2023
@sfc-gh-rcheng sfc-gh-rcheng marked this pull request as ready for review September 14, 2023 22:22
@@ -31,10 +31,19 @@ def send(self):
print("Sending in Partition:" + str(p))
key = []
value = []
for e in range(self.recordNum):
for e in range(self.recordNum - 1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are sending the tombstone record as the last record. The schema evolution tests rely on exactly recordNum rows being sent (throws NonRetryableError). Additionally a majority of the tests rely on row count as a buffer threshold, so sending the tombstone record as an additional row would increase test runtime

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.
why do we have to make changes in normal snowpipe streaming tests? we can do this change in tombstone specific tests right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i could add a tombstone specific test, but i didn't want to extend the e2e runtime. plus tombstone ingestion is enabled by default, so i think it makes sense to put it in the normal e2e tests

happy to copy this into a new test if you prefer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, thanks for explanation, i missed it is default enabled. please add a comment wherever possible. Thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point it is confusing, added comments to the tests

@sfc-gh-japatel
Copy link
Collaborator

Can you add a description on what bug is fixed?

@sfc-gh-rcheng
Copy link
Collaborator Author

Can you add a description on what bug is fixed?

Updated the description. The fixed bug is to enable tombstone ingestion with schematization

Copy link
Collaborator

@sfc-gh-japatel sfc-gh-japatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!
Left a suggestion for adding comment about tombstone since it wont be obvious by reading test code why we expect one less record. But thanks for adding all those tests. Good stuff.
Lets also get an approval from @sfc-gh-tzhang related to schematization

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -275,6 +275,13 @@ public Map<String, Object> getProcessedRecordForStreamingIngest(SinkRecord recor
private Map<String, Object> getMapFromJsonNodeForStreamingIngest(JsonNode node)
throws JsonProcessingException {
final Map<String, Object> streamingIngestRow = new HashMap<>();

// return empty if tombstone record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably has tested this, but all columns will be NULL for that row?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup tombstone records only have a RECORD_METADATA column and the rest are all null. The key is in the RECORD_METADATA
image

@sfc-gh-tzhang
Copy link
Contributor

  • Adds comment about a bug in ignoring snowpipe tombstone records

I think we want to fix this as well

@sfc-gh-rcheng
Copy link
Collaborator Author

  • Adds comment about a bug in ignoring snowpipe tombstone records

I think we want to fix this as well

I'll do this in a followup PR, I want to reduce existing tombstone behavior changes

@sfc-gh-rcheng sfc-gh-rcheng merged commit f0e2db6 into master Sep 20, 2023
@sfc-gh-rcheng sfc-gh-rcheng deleted the rcheng-tbschema branch September 20, 2023 22:08
khsoneji pushed a commit to confluentinc/snowflake-kafka-connector that referenced this pull request Oct 12, 2023
khsoneji pushed a commit to confluentinc/snowflake-kafka-connector that referenced this pull request Oct 12, 2023
khsoneji pushed a commit to confluentinc/snowflake-kafka-connector that referenced this pull request Oct 12, 2023
khsoneji pushed a commit to confluentinc/snowflake-kafka-connector that referenced this pull request Oct 12, 2023
EduardHantig pushed a commit to streamkap-com/snowflake-kafka-connector that referenced this pull request Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants