Sparkify a music streaming company wants to automate thier ETL pipleline using Apache Airflow
Built a Data pipeline that is optimized to support reusable tasks, can be monitored and also allow for backfills, also included are data quality checks which are run against the datasets after ETL steps are completed for data quality assurance.
- Load data from S3 into Amazon Redshift warehouse
- 2 S3 data sets are used, song CSV files
s3://udacity-dend/song_data
- Event data as json files of what users are listening to
s3://udacity-dend/log_data
. - Perform data quality checks on tables created on redshift
Dag
: Contains all task to be runOperators
: Operator templates (dedicated tasks)Helpers
: SQL files with different trasformations
The following tasks will be created in Airflow
StageToRedshiftOperator
: For stage data from s3 to tables in redshiftLoadDimensionOperator
: For loading data to dimension tables on redshiftLoadFactOperator
: For loading data into facts table on redshiftDataQualityOperator
: For performing data quality checks.
Below is the Sparkify Airflow dependency graph