Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve schemas in parallel #85

Merged
merged 1 commit into from
Sep 16, 2024
Merged

Conversation

istreeter
Copy link
Contributor

I have seen examples from our apps where cpu usage and event throughput periodically drops. This appears to coincide with the Iglu cache expiration time.

I believe this happens because all schemas tend to expire at the same time and need to be re-fetched by iglu-scala-client. Currently, we traverse over schemas sequentially, so we need to wait for each success before fetching the next schema. For a pipeline using many schemas, this can be a long period of downtime (several seconds) as we pause for schema resolution.

This commit changes to resolving schemas in parallel, so the downtime pauses should be shorter.

I have seen examples from our apps where cpu usage and event throughput
periodically drops.  This appears to coincide with the Iglu cache
expiration time.

I believe this happens because all schemas tend to expire at the same
time and need to be re-fetched by iglu-scala-client. Currently, we
traverse over schemas sequentially, so we need to wait for each success
before fetching the next schema.  For a pipeline using many schemas,
this can be a long period of downtime (several seconds) as we pause for
schema resolution.

This commit changes to resolving schemas in parallel, so the downtime
pauses should be shorter.
@@ -61,7 +62,7 @@ object NonAtomicFields {
// Remove whole schema family if there is no subversion left after filtering
subVersions.nonEmpty
}
.traverse { case (tabledEntity, subVersions) =>
.parTraverse { case (tabledEntity, subVersions) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will generate a bunch of requests to the Iglu server at once. But I think that this is fine as each instance should be requesting only a few schemas at once, and even if we have tens of instances, it is very unlikely that all of them resolve the schemas at the exact same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point. It could be many 10s of schemas. I think we get best separation of concerns if stuff like that to be controlled by the connection pool in the HTTP client. Blaze by default allows 256 concurrent connections per server, and that is probably a bit too large for us.

I might make this change in combination with adding a default common-streams HTTP client where we override some config options like max connections per server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be many 10s of schemas

If we have only one loader requesting that that's fine (e.g. one collector instance can handle constantly more than 1000 requests / second). If we have tens of loaders requesting it, that's a little bit scarier, but that's still probably fine, it's unlikely to happen, and there is your new exitOnMissingSchema feature (which I think is a great idea).

I think we get best separation of concerns if stuff like that to be controlled by the connection pool in the HTTP client.

I very much agree ! This code shouldn't worry about the under-the-hood HTTP client

I might make this change in combination with adding a default common-streams HTTP client where we override some config options like max connections per server

👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #87 to address the http client connection pool.

@istreeter istreeter merged commit e0f8592 into develop Sep 16, 2024
1 check passed
@istreeter istreeter deleted the resolve-schemas-in-parallel branch September 16, 2024 09:46
istreeter added a commit to snowplow-incubator/snowplow-lake-loader that referenced this pull request Sep 20, 2024
The following improvements are introduced via common-streams 0.8.0-M4:

- Fields starting with a digit are now prefixed with an underscore `_`.
  This is needed for Hudi, which does not allow fields starting with a
  digit (snowplow/schema-ddl#209)
- New kinesis source implementation without fs2-kinesis
  (snowplow-incubator/common-streams#84)
- Iglu schemas are resolved in parallel, for short pause times during
  event processing (snowplow-incubator/common-streams#85)
- Common http client configured with restricted max connections per
  server (snowplow-incubator/common-streams#87)
- Iglu scala client 3.2.0 no longer relies on the "list" schemas
  endpoint (snowplow/iglu-scala-client#255)
istreeter added a commit to snowplow-incubator/snowplow-lake-loader that referenced this pull request Sep 20, 2024
The following improvements are introduced via common-streams 0.8.0-M4:

- Fields starting with a digit are now prefixed with an underscore `_`.
  This is needed for Hudi, which does not allow fields starting with a
  digit (snowplow/schema-ddl#209)
- New kinesis source implementation without fs2-kinesis
  (snowplow-incubator/common-streams#84)
- Iglu schemas are resolved in parallel, for short pause times during
  event processing (snowplow-incubator/common-streams#85)
- Common http client configured with restricted max connections per
  server (snowplow-incubator/common-streams#87)
- Iglu scala client 3.2.0 no longer relies on the "list" schemas
  endpoint (snowplow/iglu-scala-client#255)
oguzhanunlu pushed a commit to snowplow-incubator/snowplow-lake-loader that referenced this pull request Nov 1, 2024
The following improvements are introduced via common-streams 0.8.0-M4:

- Fields starting with a digit are now prefixed with an underscore `_`.
  This is needed for Hudi, which does not allow fields starting with a
  digit (snowplow/schema-ddl#209)
- New kinesis source implementation without fs2-kinesis
  (snowplow-incubator/common-streams#84)
- Iglu schemas are resolved in parallel, for short pause times during
  event processing (snowplow-incubator/common-streams#85)
- Common http client configured with restricted max connections per
  server (snowplow-incubator/common-streams#87)
- Iglu scala client 3.2.0 no longer relies on the "list" schemas
  endpoint (snowplow/iglu-scala-client#255)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants