Skip to content

Resources for querying Snowplow bad rows in structured tables

Notifications You must be signed in to change notification settings

dailymuse/snowplow-badrows-tables

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

snowplow-badrows-tables

Resources to enable SQL queries over Snowplow bad rows

Overview

A "bad row" is any payload that hits a Snowplow pipeline, but which cannot be successfully parsed or processed. Bad rows are preserved by writing them to cloud storage (e.g. S3 or GCS) as newline-delimited JSON objects.

Athena on AWS and BigQuery on GCP are tools that let you query your bad rows, using the cloud storage files as a back-end data source.

SELECT data.failure.messages FROM adatper_failures
WHERE data.failure.timestamp > timestamp '2020-04-01'

This is great for debugging your pipeline without the need to load your bad rows into a separate database.

Setup

To create tables in Athena or BigQuery, you need to provide a table definitions corresponding to the JSON schema of your bad row files. Each different bad row type (e.g. schema violations, adapter failures) has its own JSON schema, and therefore requires its own separate table.

This repo is split into:

About

Resources for querying Snowplow bad rows in structured tables

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published