KafkaV2SinkConnector #38973

xinlian12 · 2024-02-27T20:38:41Z

Feature request #38769

In this PR, we added the kafka CosmosDB Sink Connector V2 version.

Config

kafka.connect.cosmos.accountEndpoint -> No default. Cosmos DB Account Endpoint Uri
kafka.connect.cosmos.accountKey -> No default. Cosmos DB Account Key
kafka.connect.cosmos.useGatewayMode -> Default false. Flag to indicate whether to use gateway mode. By default it is false.
kafka.connect.cosmos.preferredRegionsList -> Default empty list. Preferred regions list to be used for a multi region Cosmos DB account. This is a comma separated value (e.g., [East US, West US] or East US, West US) provided preferred regions will be used as hint. You should use a collocated kafka cluster with your Cosmos DB account and pass the kafka cluster region as preferred region. See list of azure regions here
kafka.connect.cosmos.applicationName -> Default empty string. Will be added as the userAgent suffix.
kafka.connect.cosmos.sink.database.name -> No Default. CosmosDb database name.
kafka.connect.cosmos.sink.containers.topicMap -> No Default. A comma delimited list of Kafka topics mapped to Cosmos containers. For example: topic1#con1,topic2#con2.
kafka.connect.cosmos.sink.errors.tolerance -> Default None. Error tolerance level after exhausting all retries. None for fail on error. All for log and continue
kafka.connect.cosmos.sink.bulk.enabled -> Default true. Flag to indicate whether Cosmos DB bulk mode is enabled for Sink connector.
kafka.connect.cosmos.sink.bulk.maxConcurrentCosmosPartitions -> Default -1. Usually this is only required to be tuned for large containers. Cosmos DB Item Write Max Concurrent Cosmos Partitions. If not specified it will be determined based on the number of the container's physical partitions which would indicate every batch is expected to have data from all Cosmos physical partitions. If specified it indicates from at most how many Cosmos Physical Partitions each batch contains data. So this config can be used to make bulk processing more efficient when input data in each batch has been repartitioned to balance to how many Cosmos partitions each batch needs to write. This is mainly useful for very large containers (with hundreds of physical partitions.
kafka.connect.cosmos.sink.bulk.initialBatchSize -> Default 1. Cosmos DB initial bulk micro batch size - a micro batch will be flushed to the backend when the number of documents enqueued exceeds this size - or the target payload size is met. The micro batch size is getting automatically tuned based on the throttling rate. By default the initial micro batch size is 1. Reduce this when you want to avoid that the first few requests consume too many RUs.
kafka.connect.cosmos.sink.write.strategy -> Default ItemOverwrite. Cosmos DB Item write Strategy: ItemOverwrite (using upsert), ItemAppend (using create, ignore pre-existing items i.e., Conflicts), ItemDelete (deletes based on id/pk of data frame), ItemDeleteIfNotModified (deletes based on id/pk of data frame if etag hasn't changed since collecting id/pk), ItemOverwriteIfNotModified (using create if etag is empty, update/replace with etag pre-condition otherwise, if document was updated the pre-condition failure is ignored)
kafka.connect.cosmos.sink.maxRetryCount -> Default 10. Cosmos DB max retry attempts on write failures for Sink connector. By default, the connector will retry on transient write errors for up to 10 times.
kafka.connect.cosmos.sink.id.strategy -> Default ProvidedInValueStrategy. A strategy used to populate the document with an id. Valid strategies are: TemplateStrategy, FullKeyStrategy, KafkaMetadataStrategy, ProvidedInKeyStrategy, ProvidedInValueStrategy. Configuration properties prefixed withid.strategy are passed through to the strategy. For example, when using id.strategy=TemplateStrategy , the property id.strategy.template is passed through to the template strategy and used to specify the template string to be used in constructing the id.

azure-sdk · 2024-02-27T23:39:42Z

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure.cosmos.kafka:azure-cosmos-kafka-connect

xinlian12 · 2024-03-04T19:00:08Z

/azp run java - cosmos - tests

azure-pipelines · 2024-03-04T19:00:25Z

Azure Pipelines successfully started running 1 pipeline(s).

sdk/cosmos/azure-cosmos-kafka-connect/pom.xml

eng/versioning/external_dependencies.txt

sdk/cosmos/azure-cosmos-kafka-connect/doc/configuration-reference.md

...src/main/java/com/azure/cosmos/kafka/connect/implementation/KafkaCosmosExceptionsHelper.java

...nnect/src/main/java/com/azure/cosmos/kafka/connect/implementation/sink/CosmosSinkConfig.java

FabianMeiswinkel · 2024-03-05T03:23:04Z

.../src/main/java/com/azure/cosmos/kafka/connect/implementation/sink/KafkaCosmosWriterBase.java

+                            .getCosmosAsyncContainerAccessor()
+                            .getLinkWithoutTrailingSlash(container),
+                        null,
+                        new DocumentCollection())


why new DocumentCollection and not null? I think his param is meant to be used when asynccache has an instance believed to be stale?

both works. The value used here just mean if the cached value is not the same as the staled value, then use the cached value directly. So null or new DocumentCollection() would have the same effect

BUt null is cheaper (no new garbage instantiation)

will change in next PR

...va/com/azure/cosmos/kafka/connect/implementation/sink/idstrategy/TemplateStrategyConfig.java

...rc/main/java/com/azure/cosmos/kafka/connect/implementation/source/CosmosChangeFeedModes.java

FabianMeiswinkel

LGTM

xinlian12 · 2024-03-12T21:17:47Z

/azp run java - cosmos - tests

azure-pipelines · 2024-03-12T21:18:00Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2024-03-12T21:45:51Z

/azp run java - cosmos - tests

azure-pipelines · 2024-03-12T21:46:05Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2024-03-12T22:04:13Z

/azp run java - cosmos - tests

azure-pipelines · 2024-03-12T22:04:28Z

Azure Pipelines successfully started running 1 pipeline(s).

This reverts commit 6c983ab.

* Revert "KafkaV2SinkConnector (#38973)" This reverts commit 6c983ab. * Revert "UsingTestContainerForKafkaIntegrationTests (#38884)" This reverts commit 12bec49. * Revert "KafkaV2SourceConnector (#38748)" This reverts commit 30835d9. * revert one more change * revert change --------- Co-authored-by: annie-mac <[email protected]>

* add sink connector v2 implementation --------- Co-authored-by: annie-mac <[email protected]>

annie-mac added 2 commits February 24, 2024 13:52

some change

98f3d08

add sink connector v2 implementation

8838edd

xinlian12 requested review from alzimmermsft, samvaity, g2vinay, JonathanGiles, chenrujun, Netyyyy, saragluna, moarychan, kushagraThapar, FabianMeiswinkel, kirankumarkolli, milismsft, aayush3011, simorenoh, jeet1995 and Pilchie as code owners February 27, 2024 20:38

github-actions bot added the Cosmos label Feb 27, 2024

annie-mac added 2 commits February 27, 2024 12:52

changes

1c7d9b5

update pom

ac1a96f

xinlian12 force-pushed the kafkaV2SinkConnector-2 branch from 20587ec to ac1a96f Compare February 27, 2024 23:16

annie-mac added 2 commits February 27, 2024 15:17

Merge branch 'main' into kafkaV2SinkConnector-2

c284b1f

change to use customized schedulers

b1bc7ee

pom file update

223d6d7

xinlian12 force-pushed the kafkaV2SinkConnector-2 branch from 83ff738 to 223d6d7 Compare March 1, 2024 01:56

annie-mac added 3 commits February 29, 2024 18:49

update pom file

a5116b2

fix

490d550

merge from main and resolve conflicts

32fc004

xinlian12 requested a review from mssfang as a code owner March 1, 2024 16:45