Version 3.0.0
Assets
This release concerns 3 assets:
enrich-kinesis
: this is the new enrich asset for AWS that aims at replacingStream Enrich
.enrich-pubsub
: this is now the only enrich asset maintained for GCP.Stream Enrich
: this asset for AWS is still supported until the transition toenrich-kinesis
is complete and until a new assetenrich-kafka
is ready. In this release it just received libs bumps.
As announced previously in this post, Beam Enrich
is now deprecated, in favor of enrich-pubsub
.
enrich-kinesis
This new enrich asset for Kinesis is based on fs2 and shares most of its codebase with enrich-pubsub
.
Compared with Stream Enrich
, this app brings several improvements:
- It can export metrics. More details can be found on this page.
- Assets used in the enrichments (e.g. MaxMind DB) can be periodically refreshed while enrich is running with this config parameter:
"assetsUpdatePeriod": "7 days"
- It uses Kinesis Consumer Library 2.x.
- It provides the pipeline operator with more possibilities for fine-tuning.
- It is now possible to use Kinesis aggregation, which consists in putting several user records (e.g. enriched events) into one Kinesis record. It allows to improve the throughput and/or possibly reduce the number of shards needed (in particular if records are bigger than 1 kb). More information about aggregation can be found here. It can be activated with the following section in the configuration (e.g. for enriched events):
"output": {
"good": {
"aggregation": {
"maxCount": 1000
"maxSize": 51200
}
}
}
- It is possible to run the app with a very minimal configuration file, like such:
{
"input": {
"streamName": "collector-payloads"
}
"output": {
"good": {
"streamName": "enriched"
}
"bad": {
"streamName": "bad"
}
}
}
Instructions to run enrich-kinesis
can be found on this page and details about its configuration on this page.
enrich-pubsub
More parameters have been exposed in the config file to get more fine-grained control on the app.
All the details about its configuration can be found on this page.
Javascript enrichment: ECMAScript 6 features (#508)
Users of the Javascript enrichment will be pleased to hear that starting from this version, most of ECMAScript 6 features are supported. For example, ES6 features like the arrow =>
syntax and the const
keyword are now available. This change is fully backward-compatible and the existing configs will keep on working.
More details on Javascript enrichment can be found on this page.
Enriched events validation in enrich-kinesis
and enrich-pubsub
(#517)
Enriched events emitted by enrich are expected to match atomic schema. If an event is not valid against this schema (for instance because a field is too long), a bad row should be emitted instead of the enriched event. In order to improve furthermore the data quality inside the pipeline, enrich 3.0.0 introduces this additional check.
However, we are aware that this is a breaking change, and we want to give some time to users to adapt, in case today they are working downstream with enriched events that are not valid against atomic
. For this reason, this new validation was added as a feature that can be deactivated like that:
"featureFlags": {
"acceptInvalid": true
}
In this case, enriched events that are not valid against atomic
schema will still be emitted as before, so that enrich 3.0.0 can be fully backward compatible. It will be possible to know if the new validation would have had an impact by 2 ways:
- A new metric
invalid_enriched
has been introduced. It reports the number of enriched events that were not valid againstatomic
schema. As the other metrics, it can be seen on stdout and/or StatsD. - Each time an enriched event is invalid against
atomic
schema, a line will be logged with the bad row that would have been emitted normally instead of the enriched event (add-Dorg.slf4j.simpleLogger.log.InvalidEnriched=debug
to theJAVA_OPTS
to see it).
In a few months, we'll remove the feature flag and it will become impossible to emit invalid enriched events.
Metrics for enrich-kinesis
and enrich-pubsub
(#494)
There were 2 issues with the metrics periodically sent by enrich-pubsub
:
- The counts of collector payloads, enriched events and bad rows were ever-increasing and not reset to 0 after sending the metrics.
- These counts were sent to StatsD with this format:
snowplow.enrich.good:1234|g|#key1:value1
whereg
means gauge, whereas it should bec
for counter.
This has been fixed. On top of that, it is now possible to see the metrics directly in the logs of the app, with this section in the config file:
"monitoring": {
"metrics": {
"stdout": {
"period": "1 minute"
"prefix": "snowplow.enrich."
}
}
}
Because enrich-pubsub
and enrich-kinesis
share most of the code, all of the above is also true for the latter.
More information about metrics can be found on this page.
YAUAA context 1-0-3 (#515)
The context attached by YAUAA enrichment has been updated to 1-0-3.
Compared to 1-0-2
, this version allows a longer agentVersionMajor
string field, which addresses a problem in which some some user agents caused the old maximum length to be exceeded, resulting in a failed event.
Telemetry in enrich-kinesis
and enrich-pubsub
(#487)
enrich-kinesis
and enrich-pubsub
introduce telemetry, which consists in regularly sending heartbeats with some meta-information about the application (schema here). This is done to help us to improve the product, we need to understand what is popular, so that we can focus our development effort in the right place.
At the base, telemetry is sending the application name and version every hour. It would be helpful for us if users could provide userProvidedId
in the config file :
"telemetry": {
"userProvidedId": "myCompany"
}
Telemetry can be deactivated by putting the following section in the configuration file:
"telemetry": {
"disable": true
}
Changelog
- enrich-pubsub: split into common module and PubSub module (#473)
- enrich-pubsub: Bump fs2-google-pubsub to 0.18.1 (#513)
- enrich-kinesis: create enrich asset based on fs2 (#480)
- common-fs2: Metrics: send counts instead of gauges (#494)
- common-fs2: File sink should rotate files with maximum size (#440)
- common-fs2: put good bad and pii inside output {} in config (#493)
- Bump circe to 0.14.1 (#496)
- Set spray-json transitive dependency to 1.3.6 (#498)
- Bump jackson-databind to 2.11.4 (#499)
- Bump snowplow-badrows to 2.1.1 (#500)
- Remove tomcat-embed-core transitive dependency (#501)
- Set netty transitive dependency to 4.1.68.Final (#502)
- Add possibility to use STS to authenticate (#318)
- Bump Snowplow Scala tracker to 1.0.0 (#504)
- common-fs2: add telemetry (#489)
- Bump Iglu client to 1.1.1 (#507)
- enrich-pubsub: add reference.conf and provide minimal config example (#505)
- Add Github Action to scan Docker images with lacework (#506)
- common: use schemas/nl.basjes/yauaa_context/jsonschema/1-0-3 (#515)
- Enable ES6 by default in javascript enrichment (#508)
- Publish arm64 and amd64 docker images (#491)
- Beam Enrich: deprecate (#530)
- common: SQL enrichment: fix getConnection for Sync (#546)
- common: catch and handle errors in the CurrencyConversionEnrichment (#542)
- Validate enriched event against atomic schema before emitting (#517)
- Stream Enrich Kafka: Enable AWS MSK IAM Authentication (#547)
- Stream Enrich Kafka: bump Kafka Client to 2.8.1 (#518)
- common: add Adapted type (#560)
- enrich-kinesis: add integration test (#531)
- enrich-pubsub: bump GCP SDK to 2.4.2 (#562)
- Set protobuf-java transitive dependency to 3.19.4 (#561)
- enrich-pubsub: set gson transitive dependency to 2.9.0 (#565)
- enrich-pubsub: set google-oauth-client transitive dependency to 1.33.1 (#566)