Skip to content

Latest commit

 

History

History
 
 

Project 3 Data Warehouse on AWS Redshift

Language Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Data Warehouses with AWS Redshift

Udacity Nanodegree Course Project 3
Explore the repository»

redshift, aws, data warehousing, data engineering, ETL

About The Project

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. Sparkify has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the application, as well as a directory with JSON metadata on the songs in their application.

They'd like a data engineer to build an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. The role of this project is to create a data warehouse on cloud (AWS Redshift) and build ETL pipeline for this analysis.

Project Description

In this project, we will build a data warehouse on AWS and build an ETL pipeline for a database hosted on Redshift. The data is loaded from S3 buckets to staging tables on Redshift and modeled into fact and dimensions tables to perform analytics and obtain meaningful insights.

Built With

  • python
  • AWS

Dataset

Song Dataset

Songs dataset is a subset of Million Song Dataset. Each file in the dataset is in JSON format and contains meta-data about a song and the artist of that song. The dataset is hosted at S3 bucket s3://udacity-dend/song_data.

Sample Record :

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

Logs dataset is generated by Event Simulator. These log files in JSON format simulate activity logs from a music streaming application based on specified configurations. The dataset is hosted at S3 bucket s3://udacity-dend/log_data.

Sample Record :

{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

Database Schema Design

Data Model ERD

The Star Database Schema (Fact and Dimension Schema) is used for data modeling in this ETL pipeline. There is one fact table containing all the metrics (facts) associated to each event (user actions), and four dimensions tables, containing associated information such as user name, artist name, song meta-data etc. This model enables to search the database schema with the minimum number of SQL JOINs possible and enable fast read queries. The amount of data we need to analyze is not big enough to require big data solutions or NoSQL databases.

The data stored on S3 buckets is extracted to staging tables staging_events and staging_songs on Redshift. Then the data from these tables are transformed and inserted into the fact and dimensional tables. An entity relationship diagram (ERD) of the data model is given below.

database

Project structure

Files in this repository:

File / Folder Description
images Folder at the root of the project, where images are stored
aws_cluster_create.py Creates and sets up a Redshift cluster on AWS with proper configurations
aws_cluster_destroy.py Destorys the Redshift cluster on AWS, if exists
sql_queries.py Contains the SQL queries for staging, schema definition and ETL
create_tables.py Drops and creates tables on AWS Redshift (Reset the tables)
etl.py Stages and transforms the data from S3 buckets and loads them into tables
analyze.py Basic querying from all tables created to ensure their validity
dwh.cfg Sample configuration file for AWS
README Readme file

Getting Started

Clone the repository into a local machine using

git clone https://github.com/vineeths96/Data-Engineering-Nanodegree

Prerequisites

These are the prerequisites to run the program.

  • python 3.7
  • PostgreSQL
  • AWS account
  • psycopg2 python library
  • boto3 python library

How to run

Follow the steps to extract and load the data into the data model.

  1. Navigate to Project 3 Data Warehouse on AWS folder

  2. Edit the dwh.cfg configuration file and fill in the AWS Access Key and Secret Key fields

  3. Run aws_cluster_create.py to create the clusters on AWS by

    python aws_cluster_create.py

    The type, number of nodes and other specifications of cluster will be as per the configuration file. Wait till cluster creation confirmation is displayed.

  4. Run create_tables.py to create/reset the tables by

    python create_tables.py
  5. Run ETL process and load data into database by

    python etl.py

    This will execute SQL queries corresponding to staging data from S3 on Redshift and to transform and insert into the Postgres tables on Redshift.

  6. Run analyze.py to validate the entry of data into tables by

    python analyze.py

    This runs some sample queries from all the tables created.

  7. Run aws_cluster_destroy.py to destroy the clusters on AWS by

    python aws_cluster_destroy.py

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Vineeth S - [email protected]

Project Link: https://github.com/vineeths96/Data-Engineering-Nanodegree