Dependency on analytics SDK? #53

alexanderdean · 2021-03-05T14:07:32Z

The idea of a dependency on the analytics SDK seems odd to me.

I think of bad rows as a core part of the pipeline, versus the analytics SDK as a downstream consumer of the upstream core pipeline outputs.

What am I missing?

dilyand · 2021-03-05T14:38:42Z

I think the only reason why we depend on the analytics sdk here is that the sdk is where we define a class for the canonical Snowplow event.

This is quite a chunky class and it makes sense that we should only have it defined in one place for maintainability. I believe it was first introduced in the sdk because the need for it there is quite obvious; and then we kept reusing it elsewhere.

Maybe it would make sense to have it as mini-library that can be depended on instead of the full sdk.

benjben · 2021-03-05T14:39:08Z

Some bad rows (e.g. loader bad rows) contain the enriched event (e.g. here and soon here).

This enriched event is defined in the analytics SDK here.

We plan to emit this case class also in enrich, so that there is a single source of truth to define an enriched event.

Analytics SDK can be seen as the definition of an enriched events + JSON/TSV encoders/decoders, and we need JSON ones in bad rows library.

benjben · 2021-03-05T14:39:31Z

Hehe Dilyan was quicker

alexanderdean · 2021-03-05T15:37:18Z

I agree with @dilyand - the enriched event should be its own mini-library, and that should be pulled into bad-rows and the analytics-sdk.

The analytics sdk should be a downstream consumer of entities produced by upstream.

To put it another way: the fact that the enriched event was first defined in the analytics SDK, not in an upstream library, is a bug not a feature.

dilyand · 2021-03-05T18:22:50Z

Thinking about it more, in light of what Ben said, the analytics sdk is indeed just that mini library: it has the definition of the canonical event, of unstruct event and other bits and pieces to do with them, and then just codecs to derive those classes from data and to produce data from the class instances.

In fact, there's no analytics at all in the sdk -- no aggregation, no statistics, no pivoting... Maybe we named it that way because we wanted to add those features in the future? But right now it just looks like a misnomer.

alexanderdean · 2021-03-05T18:53:43Z

The analytics SDKs are small libraries designed for downstream consumers of the Snowplow technology:

https://docs.snowplowanalytics.com/docs/modeling-your-data/analytics-sdk/

It makes sense for the Scala analytics SDK to be a very lightweight shim over core libraries written in Scala that the upstream pipeline uses.

To me it doesn't make sense at all for an 'analytics SDK' to be embedded in the core of Snowplow, any more than I'd expect a tracker to be embedded in the core.

My proposal is to move the core event technology up into an upstream library and out of this analytics SDK, then add it in as a dependency.

If we don't do this, we will end up with a weird circular reference when we implement snowplow/snowplow-scala-analytics-sdk#114

dilyand · 2021-03-05T20:56:04Z

move the core event technology up into an upstream library and out of this analytics SDK

Absolutely. I was just pointing out that when we move the core event tech out, there would be nothing left in this SDK. Unless it be the new functionality that would make it easy to work with bad rows. But then maybe that functionality should be part of the bad rows library and we don't need this SDK at all.

alexanderdean · 2021-03-05T21:25:00Z

This is what I would do:

Lift most of the Scala Analytics SDK into new library snowplow-enriched-event
Use new library snowplow-enriched-event in our enrich engine (same as we reference badrows there)

The thing I would keep in the Scala Analytics SDK are the lossy/opionated formats for shredding/toJsoning our event format. At the moment we only have one flavour of that transformation code (e.g. in this file) but we will likely add more as we add more downstream targets (e.g. loaders, relays), and enrich does not need to know about any of those formats (because they are consumer-specific), so they don't need to live in snowplow-enriched-event

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dependency on analytics SDK? #53

Dependency on analytics SDK? #53

alexanderdean commented Mar 5, 2021

dilyand commented Mar 5, 2021

benjben commented Mar 5, 2021 •

edited

Loading

benjben commented Mar 5, 2021

alexanderdean commented Mar 5, 2021

dilyand commented Mar 5, 2021

alexanderdean commented Mar 5, 2021 •

edited

Loading

dilyand commented Mar 5, 2021

alexanderdean commented Mar 5, 2021

Dependency on analytics SDK? #53

Dependency on analytics SDK? #53

Comments

alexanderdean commented Mar 5, 2021

dilyand commented Mar 5, 2021

benjben commented Mar 5, 2021 • edited Loading

benjben commented Mar 5, 2021

alexanderdean commented Mar 5, 2021

dilyand commented Mar 5, 2021

alexanderdean commented Mar 5, 2021 • edited Loading

dilyand commented Mar 5, 2021

alexanderdean commented Mar 5, 2021

benjben commented Mar 5, 2021 •

edited

Loading

alexanderdean commented Mar 5, 2021 •

edited

Loading