Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency on analytics SDK? #53

Open
alexanderdean opened this issue Mar 5, 2021 · 8 comments
Open

Dependency on analytics SDK? #53

alexanderdean opened this issue Mar 5, 2021 · 8 comments

Comments

@alexanderdean
Copy link
Member

The idea of a dependency on the analytics SDK seems odd to me.

I think of bad rows as a core part of the pipeline, versus the analytics SDK as a downstream consumer of the upstream core pipeline outputs.

What am I missing?

@dilyand
Copy link

dilyand commented Mar 5, 2021

I think the only reason why we depend on the analytics sdk here is that the sdk is where we define a class for the canonical Snowplow event.

This is quite a chunky class and it makes sense that we should only have it defined in one place for maintainability. I believe it was first introduced in the sdk because the need for it there is quite obvious; and then we kept reusing it elsewhere.

Maybe it would make sense to have it as mini-library that can be depended on instead of the full sdk.

@benjben
Copy link
Contributor

benjben commented Mar 5, 2021

Some bad rows (e.g. loader bad rows) contain the enriched event (e.g. here and soon here).

This enriched event is defined in the analytics SDK here.

We plan to emit this case class also in enrich, so that there is a single source of truth to define an enriched event.

Analytics SDK can be seen as the definition of an enriched events + JSON/TSV encoders/decoders, and we need JSON ones in bad rows library.

@benjben
Copy link
Contributor

benjben commented Mar 5, 2021

Hehe Dilyan was quicker

@alexanderdean
Copy link
Member Author

I agree with @dilyand - the enriched event should be its own mini-library, and that should be pulled into bad-rows and the analytics-sdk.

The analytics sdk should be a downstream consumer of entities produced by upstream.

To put it another way: the fact that the enriched event was first defined in the analytics SDK, not in an upstream library, is a bug not a feature.

@dilyand
Copy link

dilyand commented Mar 5, 2021

Thinking about it more, in light of what Ben said, the analytics sdk is indeed just that mini library: it has the definition of the canonical event, of unstruct event and other bits and pieces to do with them, and then just codecs to derive those classes from data and to produce data from the class instances.

In fact, there's no analytics at all in the sdk -- no aggregation, no statistics, no pivoting... Maybe we named it that way because we wanted to add those features in the future? But right now it just looks like a misnomer.

@alexanderdean
Copy link
Member Author

alexanderdean commented Mar 5, 2021

The analytics SDKs are small libraries designed for downstream consumers of the Snowplow technology:

https://docs.snowplowanalytics.com/docs/modeling-your-data/analytics-sdk/

It makes sense for the Scala analytics SDK to be a very lightweight shim over core libraries written in Scala that the upstream pipeline uses.

To me it doesn't make sense at all for an 'analytics SDK' to be embedded in the core of Snowplow, any more than I'd expect a tracker to be embedded in the core.

My proposal is to move the core event technology up into an upstream library and out of this analytics SDK, then add it in as a dependency.

If we don't do this, we will end up with a weird circular reference when we implement snowplow/snowplow-scala-analytics-sdk#114

@dilyand
Copy link

dilyand commented Mar 5, 2021

move the core event technology up into an upstream library and out of this analytics SDK

Absolutely. I was just pointing out that when we move the core event tech out, there would be nothing left in this SDK. Unless it be the new functionality that would make it easy to work with bad rows. But then maybe that functionality should be part of the bad rows library and we don't need this SDK at all.

@alexanderdean
Copy link
Member Author

This is what I would do:

  • Lift most of the Scala Analytics SDK into new library snowplow-enriched-event
  • Use new library snowplow-enriched-event in our enrich engine (same as we reference badrows there)

The thing I would keep in the Scala Analytics SDK are the lossy/opionated formats for shredding/toJsoning our event format. At the moment we only have one flavour of that transformation code (e.g. in this file) but we will likely add more as we add more downstream targets (e.g. loaders, relays), and enrich does not need to know about any of those formats (because they are consumer-specific), so they don't need to live in snowplow-enriched-event

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants