-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dependency on analytics SDK? #53
Comments
I think the only reason why we depend on the analytics sdk here is that the sdk is where we define a class for the canonical Snowplow event. This is quite a chunky class and it makes sense that we should only have it defined in one place for maintainability. I believe it was first introduced in the sdk because the need for it there is quite obvious; and then we kept reusing it elsewhere. Maybe it would make sense to have it as mini-library that can be depended on instead of the full sdk. |
Some bad rows (e.g. loader bad rows) contain the enriched event (e.g. here and soon here). This enriched event is defined in the analytics SDK here. We plan to emit this case class also in enrich, so that there is a single source of truth to define an enriched event. Analytics SDK can be seen as the definition of an enriched events + JSON/TSV encoders/decoders, and we need JSON ones in bad rows library. |
Hehe Dilyan was quicker |
I agree with @dilyand - the enriched event should be its own mini-library, and that should be pulled into bad-rows and the analytics-sdk. The analytics sdk should be a downstream consumer of entities produced by upstream. To put it another way: the fact that the enriched event was first defined in the analytics SDK, not in an upstream library, is a bug not a feature. |
Thinking about it more, in light of what Ben said, the analytics sdk is indeed just that mini library: it has the definition of the canonical event, of unstruct event and other bits and pieces to do with them, and then just codecs to derive those classes from data and to produce data from the class instances. In fact, there's no analytics at all in the sdk -- no aggregation, no statistics, no pivoting... Maybe we named it that way because we wanted to add those features in the future? But right now it just looks like a misnomer. |
The analytics SDKs are small libraries designed for downstream consumers of the Snowplow technology: https://docs.snowplowanalytics.com/docs/modeling-your-data/analytics-sdk/ It makes sense for the Scala analytics SDK to be a very lightweight shim over core libraries written in Scala that the upstream pipeline uses. To me it doesn't make sense at all for an 'analytics SDK' to be embedded in the core of Snowplow, any more than I'd expect a tracker to be embedded in the core. My proposal is to move the core event technology up into an upstream library and out of this analytics SDK, then add it in as a dependency. If we don't do this, we will end up with a weird circular reference when we implement snowplow/snowplow-scala-analytics-sdk#114 |
Absolutely. I was just pointing out that when we move the core event tech out, there would be nothing left in this SDK. Unless it be the new functionality that would make it easy to work with bad rows. But then maybe that functionality should be part of the bad rows library and we don't need this SDK at all. |
This is what I would do:
The thing I would keep in the Scala Analytics SDK are the lossy/opionated formats for shredding/ |
The idea of a dependency on the analytics SDK seems odd to me.
I think of bad rows as a core part of the pipeline, versus the analytics SDK as a downstream consumer of the upstream core pipeline outputs.
What am I missing?
The text was updated successfully, but these errors were encountered: