Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate enriched event against atomic schema before emitting #517

Closed
istreeter opened this issue Nov 22, 2021 · 6 comments
Closed

Validate enriched event against atomic schema before emitting #517

istreeter opened this issue Nov 22, 2021 · 6 comments
Milestone

Comments

@istreeter
Copy link
Contributor

Events emitted by enrich should conform to the atomic event schema. However, currently Enrich can emit events whose fields exceed the maximum allowed string length. There should be an enrichment that converts events to failed events if they violate the protocol.

Outstanding questions:
Should the bad row be a Schema Violation or an Enrichment Failure?
Should it be configurably on/off or enforced on?

This change will remove the need for loaders to have special handling of long fields, e.g. here

@chuwy
Copy link
Contributor

chuwy commented Nov 22, 2021

Should it be configurably on/off or enforced on?

I believe it should be configurable and be on by default. Fingerprint enrichment kind of enrichment.

Should the bad row be a Schema Violation or an Enrichment Failure?

I have very light preference towards Enrichment Failure. Enrichment is a more downstream stage and technically should have more details about the payload, so easier to inspect and recover. The enrichment could be called something like event_validation. I think it also could be combined with post-enrichment validation (snowplow/snowplow#3795)

@chuwy
Copy link
Contributor

chuwy commented Nov 22, 2021

This change will remove the need for loaders to have special handling of long fields

I think it would just remove the trimming, but we still would have to validate them and also generate a bad row

@istreeter
Copy link
Contributor Author

we still would have to validate them and also generate a bad row

Maybe we could make the event parser more strict in the analytics sdk, and then we get the bad row automatically in all loaders.

@benjben
Copy link
Contributor

benjben commented Nov 24, 2021

As I see it this should not be an enrichment, but a required step of the enrich job, as enrich should not emit events that are not valid according to atomic schema. At the moment we only check that the SDJs are valid, but we should check everything.

Ultimately the plan is to use case class Event from analytics SDK instead of the mutable EnrichedEvent that we have at the moment. When we have that, we can just call event.toSDJ.isValid that validates SelfDescribingData[Json].

For the bad row type, I think that the check should be run twice:

  1. Once when we parse the input event. If not valid the bad row type is schema violation.
  2. Once after we have run the enrichments. If not valid the bad row type is enrichment failure.

@voropaevp voropaevp changed the title Enrichment to enforce maximum field lengths Enforce maximum field lengths Dec 15, 2021
@voropaevp
Copy link
Contributor

Oversized events are not allowed at all. So this is no longer an enrichment but part of enrich itself.

@benjben benjben mentioned this issue Dec 15, 2021
15 tasks
@benjben benjben changed the title Enforce maximum field lengths Validate enriched event against atomic schema before emitting Jan 18, 2022
@benjben benjben added this to the 3.0.0 milestone Jan 26, 2022
@benjben
Copy link
Contributor

benjben commented Feb 9, 2022

If an event is not valid against atomic schema, a bad row should be emitted instead of the enriched event.
However, this is a breaking change, and we want to give some time to users to adapt, in case today they are working with enriched events that are not valid against atomic. For this reason, this new validation will be added as a feature that can be deactivated through a feature flag (acceptInvalid) in the config.

If acceptInvalid is set to false, a bad row will be emitted instead of the enriched event in case it's not valid against atomic schema.

In case acceptInvalid is set to true, then enriched events that are not valid against atomic schema will still be emitted as before, so that enrich 3.0.0 can be fully backward compatible. It will be possible to know if the new validation would have add an impact by 2 ways:

  1. Each time there will be an enriched event invalid against atomic schema, a line would be logged with the bad row (log level DEBUG logger name InvalidEnriched).
  2. We introduced a new metric invalid_enriched that reports the number of enriched events that were not valid against atomic schema. As the other metrics, it can be seen on stdout and/or StatsD.

When we'll know that all our customers don't have any invalid enriched events any more, we'll remove the feature flags and it will be impossible to emit invalid enriched events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants