MapR Yelp Tutorial
The Zeppelin notebook used in the MapR Python blog (insert link later)
Goal: Peruse the Yelp Open Dataset and plot the probability of receiving a particular rating using MatPlotLib,PySpark, SparkSQL, and MapR-DB. Tutorial assumes you’ve already uploaded the JSON dataset from here to your distributed file system and untarred it into the /user/mapr/ directory.
Step 1: Create a Python environment and store it to MapR-FS
Detailed steps for doing this with Condas can be found here. But the overall process is:
-
Create a Python environment with Pandas and MatPlotLib:
conda create -p mapr_yelp_tutorial/ python=2 pandas matplotlib
-
Zip this directory up from inside the directory:
cd mapr_yelp_tutorial/ zip -r mapr_yelp_tutorial.zip ./
-
Store this to MapR-FS
hadoop fs -put mapr_yelp_tutorial.zip /user/mapr/python_envs/
Step 2: Load the MapR Data Science Refinery and specify the Python archive created earlier in the Docker run command or environment variable file:
-
Set the following variable either in the Docker Run command or in the environment variables file you’re using:
ZEPPELIN_ARCHIVE_PYTHON=/user/mapr/python_envs/mapr_yelp_tutorial.zip
-
Log into Zeppelin on specified host and port
-
Download our demo notebook from this repo and import it into Zeppelin