-
Notifications
You must be signed in to change notification settings - Fork 8
Switching from Analytics Toolkit to spark tk Library
To support TAP analytics for Spark users, the TAP project has released the spark-tk Library, a new analytics toolkit in a Spark library format. The Spark-tk library provides the analytics functions previously available in the TAP Analytics Toolkit (which are accessed via a REST service), and supports Machine Learning by letting you ensure your data is clean for use with machine learning algorithms. spark-tk includes a Box Cox transformation, additional time series features, and introduces beta support for DICOM images. See the list of supported algorithms and operations by group here.
Both the Analytics Toolkit and the spark-tk Library are written in Python, making it relatively easy to convert applications using Analytics Toolkit API calls to spark-tk API calls. Examples below show some representative code changes that are needed, given the differences between the two implementations. Compare the Python calls in your existing code with the spark-tk Library Python calls to determine what modifications (usually minor, if any) will be needed.
Note: The Analytics Toolkit is being deprecated with the TAP 0.7.3 release and will not be available with future TAP releases.
Let’s look at the initialization differences in a basic frames example using the Analytics Toolkit and the spark-tk Library. Both examples are in Jupyter notebooks. Comparing the two versions, you can see that initialization requires different, but equivalent steps.
# First, let's verify that the ATK client libraries are installed
import trustedanalytics as ta
print "ATK installation path = %s" % (ta.__path__)
# Next, look-up your ATK server URI from the TAP Console and enter the information below.
# This setting will be needed in every ATK notebook so that the client knows what server to communicate with.
# E.g. ta.server.uri = 'demo-atk-c07d8047.demotrustedanalytics.com'
ta.server.uri = 'ENTER URI HERE'
# This notebook assumes you have already created a credentials file.
# Enter the path here to connect to ATK
ta.connect('myuser-cred.creds')
# First, let's verify that the SparkTK libraries are installed
import sparktk
print "SparkTK installation path = %s" % (sparktk.__path__)
SparkTK installation path = ['/opt/anaconda2/lib/python2.7/site-packages/sparktk']
from sparktk import TkContext
tc = TkContext()
The SparkContext created by TkContext follows the system's current Spark configuration. If your system defaults to HDFS, but you want to use a local file system instead, include
use_local_fs=True
when creating your TkContext, as follows:
tc = sparktk.TkContext(use_local_fs=True)
Integers and floating point numbers used with the Analytics Toolkit are denoted: ta.int32
. ta.int64
, ta.float32
, and ta.float64
. With the spark-tk Library, these simply become int
and float
, respectively. See the following integer example.
schema = [ ('letter', str),
('number', ta.int64) ]
schema = [ ('letter', str),
('number', int) ]
Training of models is different with the Analytics Toolkit and the spark-tk Library. The examples below come from the an LDA example in the Analytics Toolkit and the spark-tk Library.
model = ta.LdaModel()
# LDA model is trained using the frame above.
results = model.train(frame, 'doc_id', 'word_id', 'word_count',
max_iterations = 3, num_topics = 2)
# LDA model is trained using the frame above.
model = tc.models.clustering.lda.train(frame, 'doc_id', 'word_id', 'word_count',
max_iterations = 3, num_topics = 2)
print model.report
These are the kinds of changes that may be needed when switching from the Analytics Toolkit to the spark-tk Library.
For issues you may encounter when switching to spark-tk, go here.