Skip to content

Latest commit

 

History

History
33 lines (21 loc) · 1.38 KB

README.md

File metadata and controls

33 lines (21 loc) · 1.38 KB

snowplow-badrows-tables

Resources to enable SQL queries over Snowplow bad rows

Overview

A "bad row" is any payload that hits a Snowplow pipeline, but which cannot be successfully parsed or processed. Bad rows are preserved by writing them to cloud storage (e.g. S3 or GCS) as newline-delimited JSON objects.

Athena on AWS and BigQuery on GCP are tools that let you query your bad rows, using the cloud storage files as a back-end data source.

SELECT data.failure.messages FROM adatper_failures
WHERE data.failure.timestamp > timestamp '2020-04-01'

This is great for debugging your pipeline without the need to load your bad rows into a separate database.

Setup

To create tables in Athena or BigQuery, you need to provide a table definitions corresponding to the JSON schema of your bad row files. Each different bad row type (e.g. schema violations, adapter failures) has its own JSON schema, and therefore requires its own separate table.

This repo is split into: