Skip to content

rsilvery/mapr_yelp_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

MapR Yelp Tutorial

The Zeppelin notebook used in the MapR Python blog (insert link later)

Goal: Peruse the Yelp Open Dataset and plot the probability of receiving a particular rating using MatPlotLib,PySpark, SparkSQL, and MapR-DB. Tutorial assumes you’ve already uploaded the JSON dataset from here to your distributed file system and untarred it into the /user/mapr/ directory.

Step 1: Create a Python environment and store it to MapR-FS

Detailed steps for doing this with Condas can be found here. But the overall process is:

  1. Create a Python environment with Pandas and MatPlotLib:

    conda create -p mapr_yelp_tutorial/ python=2 pandas matplotlib
    
  2. Zip this directory up from inside the directory:

    cd mapr_yelp_tutorial/
    zip -r mapr_yelp_tutorial.zip ./
    
  3. Store this to MapR-FS

    hadoop fs -put mapr_yelp_tutorial.zip /user/mapr/python_envs/
    

Step 2: Load the MapR Data Science Refinery and specify the Python archive created earlier in the Docker run command or environment variable file:

  1. Set the following variable either in the Docker Run command or in the environment variables file you’re using:

    ZEPPELIN_ARCHIVE_PYTHON=/user/mapr/python_envs/mapr_yelp_tutorial.zip
    
  2. Log into Zeppelin on specified host and port

  3. Download our demo notebook from this repo and import it into Zeppelin

About

The Zeppelin notebook used in the MapR blog, Modern Python & PySpark Application Development

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published