-
Notifications
You must be signed in to change notification settings - Fork 28
Home
Welcome to the spark-tk wiki!
To get spark-tk off the ground, we have to establish its dependencies.
Get it from github https://github.com/apache/spark or use a CDH install or something. spark-tk likes version 1.6 (Actually, right now, it likes Spark 1.5, but it will like 1.6 instead very soon)
From root spark-tk folder, try to build (without running the tests)
mvn clean install -DskipTests
You should see a sparktk-core/target/core-1.0-SNAPSHOT.jar
as well as a bunch of jars in sparktk-core/target/dependencies
Now establish the location to the jars
export SPARKTK_HOME=$PWD/sparktk-core/target
(If you're only interested in the Scala API, you can skip this one)
The python sparktk library is spark-tk/python/sparktk
It has a few dependencies that you may not have. Look in the spark-tk/python/requirements.txt
to see what it needs.
Do pyspark first. Usually pyspark is sitting in your spark installation. There are a couple options: Add the path to pyspark to $PYTHONPATH or create a symlink to pyspark and put it in your site-packages folder. Something like
sudo ln -s /opt/cloudera/parcels/CDH/lib/spark/python/pyspark /usr/lib/python2.7/site-packages/pyspark
For your other dependencies, use pip2.7 to install.
pip2.7 install decorator
or pip2.7 install -r /path/to/spark-tk/python/requirements.txt
(Note: ideally you should use the same py4j that pyspark is using)
If you start up your python interpreter from the spark-tk/python folder, you'll be fine. Otherwise, sparktk needs to be in the $PYTHONPATH or symlinked as shown above. Here it is from the spark-tk root folder:
sudo ln -s $PWD/python/sparktk /usr/lib/python2.7/site-packages/sparktk
A quick way to see if things are happy is to build the code and run the tests
mvn install
To manually kick off the regression tests, cd to integration-tests
and run runtests.sh
(See the spark-tk/integration-tests/README.md
for more info)
To manually run the python unit tests, cd to python/sparktk/tests
and run runtests.sh
-
Scala docs are built with
mvn scala:doc
(output found inspark-tk/sparktk-core/target/site/scaladocs
) -
For the Python docs, see the
spark-tk/python/sparktk/doc/README.md
The sparktk library requires a SparkContext at runtime to interact with Spark. To that need, there is a class called TkContext
which provides the basic entry point to the sparktk library and holds the SparkContext. So we need to create a TkContext and either give it a SparkContext or tell it how to create one.
>>> import sparktk
>>> tc = sparktk.TkContext() # passing no parameters, this creates a SparkContext based on default config
Note: Only one SparkContext can exist per session ("Cannot run multiple SparkContexts at once" --that's Spark's rule, enforced by pyspark)
The tc
object exposes the library functionality for frames and models, etc. See the Example in the spark-tk README.md for a basic look at using the TkContext.