-
Notifications
You must be signed in to change notification settings - Fork 11
Notebooks
Jupyter is a Python based notebook, which allows the usage of multiple language settings, in this case Scala + Spark.
Jupyter is installed via pip
. To install, run the following commands:
# May be required for underlying C code
sudo apt-get install build-essential python-dev
pip install jupyter
To include Spark, Toree (currently Apache Incubator) needs to be installed.
sudo pip install toree
sudo jupyter toree install
Then the kernel.json
needs to be adjusted in /path/to/jupyter/kernels/toree
(e.g. /usr/local/share/jupyter/kernels/toree
). Set SPARK_HOME
, SPARK_OPTS
etc. Quite a bit of config can be done here.
Finally, call to start server on localhost:8888:
jupyter notebook
From there you can select Toree
from New
to create a new Spark notebook. By default, this is a Scala notebook.
Straight forward to install for Python. Additional languages require some other kernels, making it a bit more complex. Using Toree for Spark seems fairly easy, further testing is required for better insight. Jupyter has the bigget user base by far, which is is big plus.
Spark Notebook is a notebook developed for Spark. It's main usage is with Scala, other languages (Python, R, ...) are following.
The easiest way to install this notebook is to got to the website, configure the required version and download it. Then run the following commands:
tar -xzf spark-notebook-<CONFIG>.tar.gz spark-notebook
cd spark-notebook # Required for relative conf paths in application
bin/spark-notebook # This starts the notebook on localhost:9000
All configuration is done in spark-notebook/conf
.
This seems to be the easiest to install. It comes with out of the box Spark/Hadoop/Hive/Parquet/... as well as Cluster support. It is well maintained and in active development. Spark-Notebook seems to be the best fit for this use case, as it addresses our need excatly. (This however, could be a problem in the future if the need changes.)
view running instance
?