Data-Engineering-Nanodegree/Project 2 Data Modeling with Apache Cassandra at master · venqics/Data-Engineering-Nanodegree

History

Name		Name	Last commit message	Last commit date
parent directory ..
event_data		event_data
images		images
Project 2.ipynb		Project 2.ipynb
README.md		README.md
event_datafile_new.csv		event_datafile_new.csv

README.md

Data Modeling ETL with Apache Cassandra

Udacity Nanodegree Course Project 2
Explore the repository»

apache, cassandra, nosql, data engineering, ETL, data modeling

About The Project

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the application, as well as a directory with JSON meta-data on the songs in their application.

They'd like a data engineer to create a Apache Cassandra database which can create queries on song play data to answer the questions and make meaningful insights. The role of this project is to create a database schema and ETL pipeline for this analysis.

Project Description

In this project, we will model the data with Apache Cassandra and build an ETL pipeline using Python. The ETL pipeline transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. We will create separate denormalized tables for answering specific queries, properly using partition keys and clustering columns.

Built With

python
Apache Cassandra
iPython notebooks

Dataset

Event Dataset

Event dataset is a collection of CSV files containing the information of user activity across a period of time. Each file in the dataset contains the information regarding the song played, user information and other attributes .

List of available data columns :

artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userId

Keyspace Schema Design

Data Model ERD

The keyspace design is shown in the image below. Each table is modeled to answer a specific known query. This model enables to query the database schema containing huge amounts of data. Relational databases are not suitable in this scenario due to the magnitude of data.

Project structure

Files in this repository:

File / Folder	Description
event_data	Folder at the root of the project, where all user activity CSVs reside
images	Folder at the root of the project, where images are stored
event_datafile_new.csv	Contains the data after merging the CSV files at `event_data`
Project 2.ipynb	iPython notebook containing the ETL pipeline including data extraction, modeling and loading into the keyspace tables.
README	Readme file

Getting Started

Clone the repository into a local machine using

git clone https://github.com/vineeths96/Data-Engineering-Nanodegree

Prerequisites

These are the prerequisites to run the program.

python 3.7
Apache Cassandra
cassandra python library

How to run

Follow the steps to extract and load the data into the data model.

Navigate to Project 2 Data Modeling with Apache Cassandra folder
Run Project 2.ipynb iPython notebook
Run Part 1 to create event_datafile_new.csv
Run Part 2 to initiate the ETL process and load data into tables
Check whether the data has been loaded into database by executing SELECT queries

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Vineeth S - [email protected]

Project Link: https://github.com/vineeths96/Data-Engineering-Nanodegree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project 2 Data Modeling with Apache Cassandra

Project 2 Data Modeling with Apache Cassandra

README.md

Data Modeling ETL with Apache Cassandra

About The Project

Project Description

Built With

Dataset

Event Dataset

Keyspace Schema Design

Data Model ERD

Project structure

Getting Started

Prerequisites

How to run

License

Contact

Files

Project 2 Data Modeling with Apache Cassandra

Directory actions

More options

Directory actions

More options

Latest commit

History

Project 2 Data Modeling with Apache Cassandra

Folders and files

parent directory

README.md

Data Modeling ETL with Apache Cassandra

About The Project

Project Description

Built With

Dataset

Event Dataset

Keyspace Schema Design

Data Model ERD

Project structure

Getting Started

Prerequisites

How to run

License

Contact