Data-Engineering-Nanodegree/Project 3 Data Warehouse on AWS Redshift at master · venqics/Data-Engineering-Nanodegree

History

Name		Name	Last commit message	Last commit date
parent directory ..
images		images
README.md		README.md
analyze.py		analyze.py
aws_cluster_create.py		aws_cluster_create.py
aws_cluster_destroy.py		aws_cluster_destroy.py
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

README.md

Data Warehouses with AWS Redshift

Udacity Nanodegree Course Project 3
Explore the repository»

redshift, aws, data warehousing, data engineering, ETL

About The Project

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. Sparkify has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the application, as well as a directory with JSON metadata on the songs in their application.

They'd like a data engineer to build an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. The role of this project is to create a data warehouse on cloud (AWS Redshift) and build ETL pipeline for this analysis.

Project Description

In this project, we will build a data warehouse on AWS and build an ETL pipeline for a database hosted on Redshift. The data is loaded from S3 buckets to staging tables on Redshift and modeled into fact and dimensions tables to perform analytics and obtain meaningful insights.

Built With

python
AWS

Dataset

Song Dataset

Songs dataset is a subset of Million Song Dataset. Each file in the dataset is in JSON format and contains meta-data about a song and the artist of that song. The dataset is hosted at S3 bucket s3://udacity-dend/song_data.

Sample Record :

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

Logs dataset is generated by Event Simulator. These log files in JSON format simulate activity logs from a music streaming application based on specified configurations. The dataset is hosted at S3 bucket s3://udacity-dend/log_data.

Sample Record :

{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

Database Schema Design

Data Model ERD

The Star Database Schema (Fact and Dimension Schema) is used for data modeling in this ETL pipeline. There is one fact table containing all the metrics (facts) associated to each event (user actions), and four dimensions tables, containing associated information such as user name, artist name, song meta-data etc. This model enables to search the database schema with the minimum number of SQL JOINs possible and enable fast read queries. The amount of data we need to analyze is not big enough to require big data solutions or NoSQL databases.

The data stored on S3 buckets is extracted to staging tables staging_events and staging_songs on Redshift. Then the data from these tables are transformed and inserted into the fact and dimensional tables. An entity relationship diagram (ERD) of the data model is given below.

Project structure

Files in this repository:

File / Folder	Description
images	Folder at the root of the project, where images are stored
aws_cluster_create.py	Creates and sets up a Redshift cluster on AWS with proper configurations
aws_cluster_destroy.py	Destorys the Redshift cluster on AWS, if exists
sql_queries.py	Contains the SQL queries for staging, schema definition and ETL
create_tables.py	Drops and creates tables on AWS Redshift (Reset the tables)
etl.py	Stages and transforms the data from S3 buckets and loads them into tables
analyze.py	Basic querying from all tables created to ensure their validity
dwh.cfg	Sample configuration file for AWS
README	Readme file

Getting Started

Clone the repository into a local machine using

git clone https://github.com/vineeths96/Data-Engineering-Nanodegree

Prerequisites

These are the prerequisites to run the program.

python 3.7
PostgreSQL
AWS account
psycopg2 python library
boto3 python library

How to run

Follow the steps to extract and load the data into the data model.

Navigate to Project 3 Data Warehouse on AWS folder
Edit the dwh.cfg configuration file and fill in the AWS Access Key and Secret Key fields
Run aws_cluster_create.py to create the clusters on AWS by
```
python aws_cluster_create.py
```
The type, number of nodes and other specifications of cluster will be as per the configuration file. Wait till cluster creation confirmation is displayed.
Run create_tables.py to create/reset the tables by
```
python create_tables.py
```
Run ETL process and load data into database by
```
python etl.py
```
This will execute SQL queries corresponding to staging data from S3 on Redshift and to transform and insert into the Postgres tables on Redshift.
Run analyze.py to validate the entry of data into tables by
```
python analyze.py
```
This runs some sample queries from all the tables created.
Run aws_cluster_destroy.py to destroy the clusters on AWS by
```
python aws_cluster_destroy.py
```

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Vineeth S - [email protected]

Project Link: https://github.com/vineeths96/Data-Engineering-Nanodegree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project 3 Data Warehouse on AWS Redshift

Project 3 Data Warehouse on AWS Redshift

README.md

Data Warehouses with AWS Redshift

About The Project

Project Description

Built With

Dataset

Song Dataset

Log Dataset

Database Schema Design

Data Model ERD

Project structure

Getting Started

Prerequisites

How to run

License

Contact

Files

Project 3 Data Warehouse on AWS Redshift

Directory actions

More options

Directory actions

More options

Latest commit

History

Project 3 Data Warehouse on AWS Redshift

Folders and files

parent directory

README.md

Data Warehouses with AWS Redshift

About The Project

Project Description

Built With

Dataset

Song Dataset

Log Dataset

Database Schema Design

Data Model ERD

Project structure

Getting Started

Prerequisites

How to run

License

Contact