Skip to content

Switching from Analytics Toolkit to spark tk Library

Daniel Smith edited this page Jan 12, 2017 · 11 revisions

Switching from the Analytics Toolkit to the new spark-tk Library

To support TAP analytics for Spark users, the TAP project has released the spark-tk Library, a new analytics toolkit in a Spark library format. The Spark-tk library provides the analytics functions previously available in the TAP Analytics Toolkit (which are accessed via a REST service), and supports Machine Learning by letting you ensure your data is clean for use with machine learning algorithms. spark-tk includes a Box Cox transformation, additional time series features, and introduces beta support for DICOM images. See the list of supported algorithms and operations by group here.

Both the Analytics Toolkit and the spark-tk Library are written in Python, making it relatively easy to convert applications using Analytics Toolkit API calls to spark-tk API calls. Examples below show some representative code changes that are needed, given the differences between the two implementations. Compare the Python calls in your existing code with the spark-tk Library Python calls to determine what modifications (usually minor, if any) will be needed.

Note: The Analytics Toolkit is being deprecated with the TAP 0.7.3 release and will not be available with future TAP releases.

Initialization

Let’s look at the initialization differences in a basic frames example using the Analytics Toolkit and the spark-tk Library. Both examples are in Jupyter notebooks. Comparing the two versions, you can see that initialization requires different, but equivalent steps.

Initialization in Analytics Toolkit

# First, let's verify that the ATK client libraries are installed
import trustedanalytics as ta
print "ATK installation path = %s" % (ta.__path__)


# Next, look-up your ATK server URI from the TAP Console and enter the information below.
# This setting will be needed in every ATK notebook so that the client knows what server to communicate with.
# E.g. ta.server.uri = 'demo-atk-c07d8047.demotrustedanalytics.com'
ta.server.uri = 'ENTER URI HERE'


# This notebook assumes you have already created a credentials file.
# Enter the path here to connect to ATK
ta.connect('myuser-cred.creds')

Initialization in spark-tk

# First, let's verify that the SparkTK libraries are installed
import sparktk
print "SparkTK installation path = %s" % (sparktk.__path__)


SparkTK installation path = ['/opt/anaconda2/lib/python2.7/site-packages/sparktk']


from sparktk import TkContext
tc = TkContext()

Note that the SparkContext created by TkContext follows the system's current Spark configuration. If your system defaults to HDFS, but you want to use a local file system instead (for example, for exporting models), include use_local_fs=True when creating your TkContext, as follows:

  tc = sparktk.TkContext(use_local_fs=True)  

Numbers

Integers and floating point numbers used with the Analytics Toolkit are denoted: ta.int32. ta.int64, ta.float32, and ta.float64. With the spark-tk Library, these simply become int and float, respectively. See the following integer example.

Integer in Analytics Toolkit

schema = [ ('letter', str),
           ('number', ta.int64) ]

Integer in spark-tk

schema = [ ('letter', str),
           ('number', int) ]

Model training

Training of models is different with the Analytics Toolkit and the spark-tk Library. The examples below come from the an LDA example in the Analytics Toolkit and the spark-tk Library.

Model training in Analytics Toolkit

model = ta.LdaModel()

# LDA model is trained using the frame above.
results = model.train(frame, 'doc_id', 'word_id', 'word_count', 
                      max_iterations = 3, num_topics = 2)

Model training in spark-tk Library

# LDA model is trained using the frame above.
model = tc.models.clustering.lda.train(frame, 'doc_id', 'word_id', 'word_count', 
                      max_iterations = 3, num_topics = 2)
print model.report

These are the kinds of changes that may be needed when switching from the Analytics Toolkit to the spark-tk Library.

Troubleshooting

For issues you may encounter when switching to spark-tk, go here.

Clone this wiki locally