Skip to content

This is the github repo for Team 12's Final Project for the Special Topics in DS (aka Data Mgmt) for the MSDS program.

License

Notifications You must be signed in to change notification settings

Rahul-Rut/msds-data-mgmt-team12-finalproject

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSDS: Data Management Final Project

This is the Github repo for Team 12's Final Project for the class, Special Topics in Data Science, aka Database Management class. This final project for that class and brings together the concepts from the class including how to setup a relational and No-SQL datastore, query and configure indices to optimize the performance. The app demos process, analyzing and searching a Twitter feed.

Stack

Backend

The stack is composed of a backend and frontend. The backend data stores are as follows

Redis: in-memory cache used to optimize certain queries

Mongo-DB: a doc-based database used to store the actual tweets. While the test data is given a-priori. The app demos ingestion one record at a time to simulate the streaming nature of the data.

Postgres: used for data that changes infrequently such as user/profile

Frontend & Search App

This is still being discussed and debated on the right option. The "low tech" solution would be ipywidgets in a Jupyter notebook and that is the "simple" fall back. But will also consider a Pythonic framework like Django which supports routing (endpoints) and web templates for a simple mostly static web page. The HTML front end will call the Python middleware which in turn interacts with the data stores.

Other Scripts

Also includes some notebooks for analyzing the data and will include a Python script for processing the input data one record at a time. Once again, simulating streaming data. In order to work with these notebooks, copy the *.ipynb files from the notebooks subdirectory to the input in the runtime directory as specified in the run.sh script.

Beyond the run script, will likely include some builds script for creating images with some database and other dependencies.

Accessing the Application and Services

All apps and services that are part of this stack use the port range, 25xxx where xxx is the used to differentiate among the services. To run the stack, simply cd into services and execute ./run.sh. Follow the prompt which ask for a password for the data stores and runtime directory where the app and database data will live, along with notebooks.

To access the programmatic interface via the integrated Jupyter service, type in http://:25888 in the browser and Jupyter Hub should come up. The notebooks will be in the /work.

The Docker hostname and ports for the respective services are as follows:

ServiceHostPortPurpose
Jupyterjupyterhub25888Programmatic access to data/app via Jupyter notebook
Postgrespgdb25432Store static/structured data like users/locations
MongoDBmongodb25017Store tweets and other dynamic data with flexible schema
Redisredisdb25379 Cacheimportant data during ingestion as well as querying

Each service is containerized and the respective CLI tools for each Database can be access by executing docker exec -it bash. Once you see the bash prompt you can follow instructions on how to access the CLI tools for each of the respective DBs. Please use the password you passed in to the run script.

About

This is the github repo for Team 12's Final Project for the Special Topics in DS (aka Data Mgmt) for the MSDS program.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.0%
  • Shell 1.0%