-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve schemas in parallel #85
Conversation
I have seen examples from our apps where cpu usage and event throughput periodically drops. This appears to coincide with the Iglu cache expiration time. I believe this happens because all schemas tend to expire at the same time and need to be re-fetched by iglu-scala-client. Currently, we traverse over schemas sequentially, so we need to wait for each success before fetching the next schema. For a pipeline using many schemas, this can be a long period of downtime (several seconds) as we pause for schema resolution. This commit changes to resolving schemas in parallel, so the downtime pauses should be shorter.
@@ -61,7 +62,7 @@ object NonAtomicFields { | |||
// Remove whole schema family if there is no subversion left after filtering | |||
subVersions.nonEmpty | |||
} | |||
.traverse { case (tabledEntity, subVersions) => | |||
.parTraverse { case (tabledEntity, subVersions) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will generate a bunch of requests to the Iglu server at once. But I think that this is fine as each instance should be requesting only a few schemas at once, and even if we have tens of instances, it is very unlikely that all of them resolve the schemas at the exact same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good point. It could be many 10s of schemas. I think we get best separation of concerns if stuff like that to be controlled by the connection pool in the HTTP client. Blaze by default allows 256 concurrent connections per server, and that is probably a bit too large for us.
I might make this change in combination with adding a default common-streams HTTP client where we override some config options like max connections per server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be many 10s of schemas
If we have only one loader requesting that that's fine (e.g. one collector instance can handle constantly more than 1000 requests / second). If we have tens of loaders requesting it, that's a little bit scarier, but that's still probably fine, it's unlikely to happen, and there is your new exitOnMissingSchema
feature (which I think is a great idea).
I think we get best separation of concerns if stuff like that to be controlled by the connection pool in the HTTP client.
I very much agree ! This code shouldn't worry about the under-the-hood HTTP client
I might make this change in combination with adding a default common-streams HTTP client where we override some config options like max connections per server
👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #87 to address the http client connection pool.
The following improvements are introduced via common-streams 0.8.0-M4: - Fields starting with a digit are now prefixed with an underscore `_`. This is needed for Hudi, which does not allow fields starting with a digit (snowplow/schema-ddl#209) - New kinesis source implementation without fs2-kinesis (snowplow-incubator/common-streams#84) - Iglu schemas are resolved in parallel, for short pause times during event processing (snowplow-incubator/common-streams#85) - Common http client configured with restricted max connections per server (snowplow-incubator/common-streams#87) - Iglu scala client 3.2.0 no longer relies on the "list" schemas endpoint (snowplow/iglu-scala-client#255)
The following improvements are introduced via common-streams 0.8.0-M4: - Fields starting with a digit are now prefixed with an underscore `_`. This is needed for Hudi, which does not allow fields starting with a digit (snowplow/schema-ddl#209) - New kinesis source implementation without fs2-kinesis (snowplow-incubator/common-streams#84) - Iglu schemas are resolved in parallel, for short pause times during event processing (snowplow-incubator/common-streams#85) - Common http client configured with restricted max connections per server (snowplow-incubator/common-streams#87) - Iglu scala client 3.2.0 no longer relies on the "list" schemas endpoint (snowplow/iglu-scala-client#255)
The following improvements are introduced via common-streams 0.8.0-M4: - Fields starting with a digit are now prefixed with an underscore `_`. This is needed for Hudi, which does not allow fields starting with a digit (snowplow/schema-ddl#209) - New kinesis source implementation without fs2-kinesis (snowplow-incubator/common-streams#84) - Iglu schemas are resolved in parallel, for short pause times during event processing (snowplow-incubator/common-streams#85) - Common http client configured with restricted max connections per server (snowplow-incubator/common-streams#87) - Iglu scala client 3.2.0 no longer relies on the "list" schemas endpoint (snowplow/iglu-scala-client#255)
I have seen examples from our apps where cpu usage and event throughput periodically drops. This appears to coincide with the Iglu cache expiration time.
I believe this happens because all schemas tend to expire at the same time and need to be re-fetched by iglu-scala-client. Currently, we traverse over schemas sequentially, so we need to wait for each success before fetching the next schema. For a pipeline using many schemas, this can be a long period of downtime (several seconds) as we pause for schema resolution.
This commit changes to resolving schemas in parallel, so the downtime pauses should be shorter.