Resolve schemas in parallel #85

istreeter · 2024-09-03T08:44:36Z

I have seen examples from our apps where cpu usage and event throughput periodically drops. This appears to coincide with the Iglu cache expiration time.

I believe this happens because all schemas tend to expire at the same time and need to be re-fetched by iglu-scala-client. Currently, we traverse over schemas sequentially, so we need to wait for each success before fetching the next schema. For a pipeline using many schemas, this can be a long period of downtime (several seconds) as we pause for schema resolution.

This commit changes to resolving schemas in parallel, so the downtime pauses should be shorter.

I have seen examples from our apps where cpu usage and event throughput periodically drops. This appears to coincide with the Iglu cache expiration time. I believe this happens because all schemas tend to expire at the same time and need to be re-fetched by iglu-scala-client. Currently, we traverse over schemas sequentially, so we need to wait for each success before fetching the next schema. For a pipeline using many schemas, this can be a long period of downtime (several seconds) as we pause for schema resolution. This commit changes to resolving schemas in parallel, so the downtime pauses should be shorter.

benjben · 2024-09-04T09:40:11Z

...common/src/main/scala/com/snowplowanalytics/snowplow/loaders/transform/NonAtomicFields.scala

@@ -61,7 +62,7 @@ object NonAtomicFields {
        // Remove whole schema family if there is no subversion left after filtering
        subVersions.nonEmpty
      }
-      .traverse { case (tabledEntity, subVersions) =>
+      .parTraverse { case (tabledEntity, subVersions) =>


This will generate a bunch of requests to the Iglu server at once. But I think that this is fine as each instance should be requesting only a few schemas at once, and even if we have tens of instances, it is very unlikely that all of them resolve the schemas at the exact same time.

That is a very good point. It could be many 10s of schemas. I think we get best separation of concerns if stuff like that to be controlled by the connection pool in the HTTP client. Blaze by default allows 256 concurrent connections per server, and that is probably a bit too large for us.

I might make this change in combination with adding a default common-streams HTTP client where we override some config options like max connections per server.

It could be many 10s of schemas

If we have only one loader requesting that that's fine (e.g. one collector instance can handle constantly more than 1000 requests / second). If we have tens of loaders requesting it, that's a little bit scarier, but that's still probably fine, it's unlikely to happen, and there is your new exitOnMissingSchema feature (which I think is a great idea).

I think we get best separation of concerns if stuff like that to be controlled by the connection pool in the HTTP client.

I very much agree ! This code shouldn't worry about the under-the-hood HTTP client

I might make this change in combination with adding a default common-streams HTTP client where we override some config options like max connections per server

👌

Opened #87 to address the http client connection pool.

The following improvements are introduced via common-streams 0.8.0-M4: - Fields starting with a digit are now prefixed with an underscore `_`. This is needed for Hudi, which does not allow fields starting with a digit (snowplow/schema-ddl#209) - New kinesis source implementation without fs2-kinesis (snowplow-incubator/common-streams#84) - Iglu schemas are resolved in parallel, for short pause times during event processing (snowplow-incubator/common-streams#85) - Common http client configured with restricted max connections per server (snowplow-incubator/common-streams#87) - Iglu scala client 3.2.0 no longer relies on the "list" schemas endpoint (snowplow/iglu-scala-client#255)

benjben approved these changes Sep 4, 2024

View reviewed changes

istreeter merged commit e0f8592 into develop Sep 16, 2024
1 check passed

istreeter deleted the resolve-schemas-in-parallel branch September 16, 2024 09:46

istreeter mentioned this pull request Sep 20, 2024

common-streams 0.8.x with refactored kinesis source snowplow-incubator/snowplow-lake-loader#81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve schemas in parallel #85

Resolve schemas in parallel #85

istreeter commented Sep 3, 2024

benjben Sep 4, 2024

istreeter Sep 4, 2024

benjben Sep 4, 2024

istreeter Sep 10, 2024

Resolve schemas in parallel #85

Resolve schemas in parallel #85

Conversation

istreeter commented Sep 3, 2024

benjben Sep 4, 2024

Choose a reason for hiding this comment

istreeter Sep 4, 2024

Choose a reason for hiding this comment

benjben Sep 4, 2024

Choose a reason for hiding this comment

istreeter Sep 10, 2024

Choose a reason for hiding this comment