Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Explore analytics options for tracking user journeys in DataHub #396

Closed
seanprivett opened this issue Jun 4, 2024 · 9 comments
Closed
Assignees

Comments

@seanprivett
Copy link
Contributor

seanprivett commented Jun 4, 2024

Do any datahub logs provide us with useful user analytics.

What do Acryl have to say about analytics?

We would like to track user behaviour in datahub what are our options?

https://datahubproject.io/docs/datahub-web-react/src/app/analytics/

@seanprivett seanprivett converted this from a draft issue Jun 4, 2024
@YvanMOJdigital YvanMOJdigital changed the title Explore analytics options for tracking user journeys in DataHub Spike: Explore analytics options for tracking user journeys in DataHub Jun 5, 2024
@murdo-moj
Copy link
Contributor

@murdo-moj
Copy link
Contributor

@YvanMOJdigital
Copy link

Readout from my Acryl intro:

Analytics: under development, they are fulfilling some ad hock customer requests but still working on it. No GA support.
Support: success engineer who is empowered to make small changes for users, weekly meetings available.
Feature requests: they are looking for active involvement from users on roadmap and strategy. Influence on custom connectors, open to contributions.
Metadata/data quality - can we pay just for that? No
Costing is based on number of read/write users and number of datasets. They prefer to start with a small set of use cases and grow from there.

They have offered a demo in a couple of weeks to show the UI if we are interested.

@MatMoore MatMoore self-assigned this Jun 12, 2024
@MatMoore MatMoore moved this from Todo to In Progress in Data Catalogue Jun 12, 2024
@MatMoore
Copy link
Contributor

MatMoore commented Jun 12, 2024

Assumptions:

  • we would like to track the end to journeys, not just part of the journey
  • if possible we would like to view the analytics for both sites in one place (Google Analytics)

Some possible options...

Develop the GA support & no-code configuration support ourselves

Proxy datahub requests through something that alters the HTML

  • modify outgoing responses to inject a GA tag
  • this would allow us to track page views but no other events (it would bypass Datahub's framework) 👎
  • complicates the support model, as our deployment would diverge from the Datahub provided helm charts 👎
  • we have to maintain extra code 👎

Use GA Data import

@MatMoore

This comment has been minimized.

@MatMoore
Copy link
Contributor

MatMoore commented Jun 12, 2024

Datahub depends on Google Analytics via this analytics library

    "@analytics/google-analytics": "^0.5.2",

Latest version is 1.0.7, and 1.0.0 is when they switched from GA3 to GA4

Universal Analytics stopped collecting data a year ago and is being turned off in a month, so no need to support it still.

The datahub plugin just wraps the track method https://github.com/DavidWells/analytics/blob/3eeba102f6db3efc89309c1b347b4e4cc2e1ccc1/packages/analytics-plugin-google-analytics/src/browser.js#L193

and then exposes page event (track) and identify methods.

Datahub is also on an old version of the analytics package "analytics": "^0.8.9", - latest is 0.8.13. But there don't seem to be any breaking changes there so should be safe to update. https://github.com/DavidWells/analytics/blob/master/packages/analytics/CHANGELOG.md

@MatMoore

This comment has been minimized.

@MatMoore
Copy link
Contributor

MatMoore commented Jun 17, 2024

Patch here datahub-project/datahub@master...MatMoore:datahub:update-analytics

I was able to verify that data comes through to the real time monitoring page when I override the initialisation to set cookie_domain to none. (Using localhost didn't seem to work)

    const googleAnalyticsPlugin = googleAnalytics({
        measurementIds,
        cookie_domain: 'none',
    });

Actually there is an existing PR for this, we just need it to be merged: datahub-project/datahub#8231

@MatMoore
Copy link
Contributor

MatMoore commented Jun 17, 2024

Here are some steps we would need to follow if zero-code analytics configuration is not released. This is actually not a huge amount of work to set up, but would add some overhead to our release process.

Separately, I've reached out on the datahub slack to ask about contributing to the zero-code analytics piece, but I think we can proceed with the setup below, and then remove it if/when zero-code analytics is deployed. This means the only blocker for us would be this upgrade: datahub-project/datahub#8231

Steps to deploy a customised frontend build

1. Checkout datahub release & apply our patch to the configuration file

OR pull from a forked repo.

I prefer the patch approach though as it's a very small change we'll be making and then we can manage the build from our existing data-catalogue repo.

2. Trigger frontend docker builds via github actions

Here is a full workflow that builds the image on demand and publishes to the github package repo. I successfully ran this on a fork of datahub.

name: build customised frontend
on:
  workflow_dispatch:

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: datahub-frontend

jobs:
  frontend_build:
    name: Build and Push DataHub Frontend Docker Image
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      attestations: write
      id-token: write
    steps:
      - name: Set up JDK 17
        uses: actions/setup-java@v3
        with:
          distribution: "zulu"
          java-version: 17

      - uses: gradle/gradle-build-action@v2

      - name: Check out the repo
        uses: acryldata/sane-checkout-action@v3

      - name: Log in to the Container registry
        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Pre-build artifacts for docker image
        run: |
          ./gradlew :datahub-frontend:dist -x test -x yarnTest -x yarnLint --parallel
          mv ./datahub-frontend/build/distributions/datahub-frontend-*.zip datahub-frontend.zip
        env:
          NODE_OPTIONS: "--max-old-space-size=3072"

      - name: Docker meta
        id: docker_meta
        uses: crazy-max/ghaction-docker-meta@v1
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME}}

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Build and Push Multi-Platform image
        id: push
        uses: docker/build-push-action@v5
        with:
          context: .
          file: ./docker/datahub-frontend/Dockerfile
          platforms: linux/amd64,linux/arm64/v8
          push: true
          tags: ${{ steps.docker_meta.outputs.tags }}
          labels: ${{ steps.docker_meta.outputs.labels }}

      - name: Generate artifact attestation
        uses: actions/attest-build-provenance@v1
        with:
          subject-name: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME}}
          subject-digest: ${{ steps.push.outputs.digest }}
          push-to-registry: true

Configure the helm chart

  • Swap out the image url for frontend
  • Verify the kubernetes cluster can pull from the new image repo

@MatMoore MatMoore moved this from In Progress to Review in Data Catalogue Jun 18, 2024
@MatMoore MatMoore moved this from Review to Done in Data Catalogue Jun 18, 2024
@MatMoore MatMoore closed this as completed by moving to Done in Data Catalogue Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done ✅
Development

No branches or pull requests

4 participants