Skip to content

Exam project for Data Science course at SDU. We use Apach Spark and HBase to analyze big-data from the city of San Francisco.

Notifications You must be signed in to change notification settings

Xitric/DataScienceDocker

 
 

Repository files navigation

Files

All the file necessary to use this project are available on OneDrive

How to get started

In order to set up this repository on your computer, you must do the following:

  1. Clone the repository to your computer
  2. Download the prepared volumes from OneDrive, and extract the zip file
  3. From the root of the project, run import/import.cmd
    • Use the location of the extracted volumes as input, such as C:/Users/Name/Desktop/Volumes
  4. Start the cluster using start.cmd
    • For stopping the cluster we recommend using stop.cmd to prevent corrupting data in HBase
  5. Wait for HDFS to exit safemode and HBase to initialize. This might take a few minutes
  6. You can now view different visualizations on localhost

Executing jobs in the cluster

  1. Go to the admin panel
  2. Ensure that all files necessary for the job are uploaded under "Upload" with the type "Spark application"
    • If using the volumes provided on OneDrive, this has already been done
    • If running jobs from scratch, make sure to upload the drivers for all python jobs as py files to HDFS. Furthermore, the code for all python files must be uploaded to the same directory as a single zip archive named files. Jar libraries on which the Spark applications depend must also be uploaded - these jars are available on OneDrive.
  3. Under "Submit Spark application", write the name of the job to execute, such as incident_aggregator and press "Submit"
  4. The status of the job is most easily tracked on Livy or by using the "Spark job status" on the admin page

Building images locally

For your convenience, all images have been uploaded to DockerHub.

If, for some reason, you wish to build images yourself, you must download the SHC connector from OneDrive and place it under pysparkApp/. This file is too large for GitHub, and our public fork does not permit the use of LFS.

About

Exam project for Data Science course at SDU. We use Apach Spark and HBase to analyze big-data from the city of San Francisco.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 39.1%
  • JavaScript 28.0%
  • HTML 10.0%
  • Java 9.9%
  • Dockerfile 5.1%
  • Batchfile 2.9%
  • Other 5.0%