Skip to content

Benchmarks

jcanny edited this page Feb 21, 2015 · 185 revisions

Table of Contents

Overview (Updated 2/20/15)

We recently ran some fresh benchmarks for Spark v1.1 and v1.2 and Graphlab clusters, and included some updated numbers from other recent published benchmarks.

The algorithms and datasets are listed below.

  • Logistic Regression for text classification (RCV1)
  • Logistic Regression for click prediction (Criteo)
  • K-means (MNIST 8M digit recognition)
  • Matrix Factorization (Netflix dataset)
  • Topic Modeling (NYTimes dataset)
  • Random Forests (Year Prediction dataset)
  • PageRank on social graphs (Twitter Followers)
  • PageRank on a web graph (AltaVista crawl)

Logistic Regression

Reuters Data

RCV1-v2 (Reuters news data, LYR2004 distribution) benchmarks are for OAA (One Against All) classification, since RCV1-v2 has 103 independent topic labels. RCV1-v2 is a small dataset (0.5 GB).

BIDMach was run on a single Amazon g2.xlarge instance, while Spark was run on a cluster of m3.2xlarge high-memory instances. The other systems were run on an 8-code Intel E-2660 system. All the systems that had support for Intel MKL were compiled with MKL support linked in. Energy and cost are estimates only based on a basic CPU (m3.2xlarge) or GPU (g2.2xlarge) EC2 instance.

Logistic Regression (sparse) on RCV1

System Nodes/Cores nclasses AUC Time Cost Energy(KJ)
Spark 18/72 1 0.93 30s $0.07 120
BIDMach 1 103 0.96 14s $0.002 3
VW 1/8 103 0.96 130s $0.02 30
Scikit-Learn 1/8 103 0.96 576s $0.08 120
Liblinear 1/8 103 0.96 250s $0.04 60

VW = Vowpal Wabbit

Accuracy (AUC - higher is better) is quoted only for topic 6, which is roughly balanced in +/- instances. The accuracy across all classes is somewhat higher, but tracks the topic 6 accuracy.

All the systems were run as OAA classifiers (multiclass) except Spark and Liblinear. Neither of these systems supports multi-label classification directly. Liblinear was trained separately for each topic.

Spark accuracy was lower than the other systems after 3 passes over the dataset, perhaps due to Spark's SGD implementation, which does only a single SGD update for each pass over the dataset. The other systems make one update per minibatch.

VW has a fast SGD implementation and uses hardware acceleration (Intel intrinsics) and is probably the fastest single-machine system in use today. See e.g.

http://fastml.com/vowpal-wabbit-liblinear-sbm-and-streamsvm-compared/

BIDMach with GPU is an order of magnitude faster than any of the single-machine systems, and outperforms Spark on a 28-node cluster while training 103x as many models.

The reference for RCV1-v2 is:

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.

Criteo Dataset

Criteo released a medium-sized (12 GB) dataset with a single target (click, no click) with a very sparse set of features. This is representative of many click prediction tasks in industry.

System Nodes/Cores npasses AUC Time Cost Energy(KJ)
Spark 8/32 10 0.62 964s $0.64 1500
Spark 32/128 10 0.62 400s $1.00 2500
BIDMach 1 1 0.66 81s $0.01 6
BIDMach 1 10 0.72 805s $0.13 60

BIDMach was run as before on a single Amazon g2.xlarge instance, while Spark 1.1 was run on a cluster of m3.2xlarge high-memory instances. Once again Spark SGD did not seem to perform well, and running more iterations did not improve it. Random guessing yields an AUC of 0.5, so BIDMach's score (0.72) represents about twice the lift of Spark (0.62). BIDMach was around 10x faster on a per-node basis. Single-target logistic regression is a worst-case for GPU acceleration, but BIDMach's numbers were quite respectable on this problem. Overall speed was higher for comparable accuracy, and cost and energy use were several orders of magnitude lower.

The dataset is available here:

http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/

K-Means

MNIST-8M dataset

The MNIST dataset is a medium-sized (25 GB) collection of images of handwritten digits. The dataset has 8 million 28x28 images labeled 0-9. K-means was run for 20 iterations.

System nodes/cores nclust SumSq Time Cost Energy(KJ)
Spark 32/128 256 1.5e13 180s $0.45 1150
BIDMach 1 256 1.44e13 220s $0.04 60
Scikit-Learn 1/8 256 3200x4 * $1.0 10
Spark 96/384 4096 1.05e13 1100 $9.00 22000
BIDMach 1 4096 0.995e13 735 $0.12 140

BIDMach was competitive with a small Spark cluster (32 nodes) on small problems (dim 256) but outperformed a ~100-node cluster on a large problem (dim 4096). This is typical for compute-intensive problems where BIDMach gets full leverage from GPU matrix kernels.

The reference for the dataset is:

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.

Matrix Factorization

Full Netflix Dataset

The dataset is a small (about 4GB) collection of movie reviews from 0.5 million users on 17770 movies. Sparse matrix factorization was used, either a batch version (also known as Alternating Least Squares) and an SGD implementation. We compared against Spark 1.1 and the archived graphlab open source Distribution. The new commercial version of graphlab doesnt yet support distributed computing.

BIDMach was run on a g2.2xlarge instance, Spark on an m3.2xlarge cluster, and Graphlab on an cc2.8xlarge cluster (the default in the graphlab EC2 script).

System Nodes/cores Dim RMSE Time Cost Energy (KJ)
Graphlab 18/576 100 376 $3.50 10,000
Spark 32/128 100 0.82 146 $0.40 1000
BIDMach 1 100 0.83 90 $0.015 20
Spark 32/128 200 0.82 544 $1.45 3500
BIDMach 1 200 0.83 129 $0.02 30
BIDMach 1 500 0.83 600 $0.10 150

Latent Dirichlet Allocation (LDA)

UCI NY Times dataset

BIDMach includes several implementations of LDA: An online Variational Bayes (VB) [3], a batch variational Bayes [1], and a "cooled" Gibbs sampler implementation which is described in a forthcoming paper [6]. We compared against 4 systems, which are all custom LDA implementations:

  1. David's Blei's original C++ implementation of VB [1]
  2. Thomas Hoffmann's implementation of Collapsed Gibbs Sampling in Matlab/C [2]
  3. Yan et al.'s implementation of Collapsed Gibbs Sampling with GPU acceleration [4]
  4. Ahmed et al.'s implementation of VB using a customized cluster implementation [5]
To the best of our knowledge, 3. was the fastest single-machine implementation of LDA other than BIDMach, and 4. was the fastest cluster implementation. 4. has been heavily optimized for LDA, and should be significantly faster than other cluster implementations. All systems were trained to convergence for that algorithm. Not all reached the same level of perplexity, and cooled GS reached the highest perplexity score [6]. The very different numbers of iterations to convergence were due to the use of parameter cooling as described in [6].
System Time Iterations Data Throughput Gflops Mflops/W
BIDMach online VB (680 GPU) 40 secs 2 40 MB/s 25 250
BIDMach Cooled GS (680 GPU) 90 secs 3 20 MB/s 30 300
BIDMach batch VB (680 GPU) 400 secs 20 50 MB/s 25 250
Yan GPU Collapsed GS 5400 secs 1000 100 MB/s 0.4 (est.) 4
Blei batch VB 252000 secs 20 0.1 MB/s 0.05 (est.) 0.5
Griffiths Collapsed GS 225000 secs 1000 2 MB/s 8 Mflops (est.) 0.08

To compare with the Yahoo_LDA cluster algorithm [5] which used proprietary news articles, we constructed a repeating stream of NYTimes articles as per [5]. There were two problem instances, first for 200 million articles, 256 topics:

System Time Iterations Data Throughput Gflops Mflops/W
BIDMach Cooled GS (1 node) 30,000 secs 2 10 MB/s 30 300
Yahoo_LDA (100 nodes) 120,000 secs 1000 1300 MB/s 8 (est.) 0.8

The second for 2 billion articles, 256 topics:

System Time Iterations Data Throughput Gflops Mflops/W
BIDMach Cooled GS (1 node) 300,000 secs 2 10 MB/s 30 300
Yahoo_LDA (1000 nodes) 240,000 secs 1000 7000 MB/s 40 (est.) 0.4

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.

[2] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.

[3] M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation. In NIPS, volume 2, page 5, 2010.

[4] F. Yan, N. Xu, and Y. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, volume 9, pages 2134–2142, 2009.

[5] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123–132. ACM, 2012.

[6] H. Zhao, and J. Canny, Cooled Gibbs Parameter Estimation, (under review) 2014.

Random Forests

For this test, we used the UCI Year Prediction dataset that was used in a recent Spark v1.2 benchmark:

https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html (about 0.2 GB)

We ran Forests with 100 trees at depth 10 (you really need depth ~ 20 to get good accuracy, but we dont have those times for Spark) to match the benchmark reported in that blog. Spark was run on 16 EC2 workers, which were r3.2xlarge. BIDMach was run on a single g2.2xlarge.

Here are the times:

System Time(s) Cost($) Energy(KJ)
Spark-16W 150 0.50 500
BIDMach 300 0.06 60

Its fair to ask whether the comparison is meaningful since Spark is a cluster implementation and in theory can do much larger datasets. There are other RF implementations that are faster per-machine (Wise-IO and the most recent Scikit-Learn implementation is very good). We think the comparison is appropriate because BIDMach's RF implementation is scalable (unlike the other single-machine systems). Its an out-of-memory implementation that pulls the data repeatedly off disk. We've computed Forests on 100 GB datasets, and it should be possible to go at least an order of magnitude beyond that, still on one machine.

With Forests of course, one has perfectly linear scaling with the number of trees. BIDMach's time above was based on blocks of 25 trees, and so one can reduce time by x4 by running on 4 nodes still for the same cost and energy. Our implementation focused on scalability on a single node, squeezing memory use down so that we can get competitive running time but avoid the overhead of distributing the calculation of single trees. We think the result is very competitive in production environments with other forest implementations, in fact we're testing this right now :-)

PageRank

It turns out that most data frameworks (Hadoop, Spark, Graphlab) use a reduce operation which isnt scalable [2]. In these systems, messages are exchanged all-to-all in a network with N nodes, and message size shrinks as O(1/N) (fixed amount of data per node) or O(1/N^2) (fixed total dataset size). Eventually messages become too small to be sent efficiently. Butterfly Allreduce primitives overcome this problem [1], [2], and provide benefits of error-tolerance and speed.

The table below shows the performance of BIDMach running on a cluster of 64 nodes, for one iteration of the PageRank algorithm, using the "Kylix" allreduce primitive [2]. The first benchmark is for the Twitter Followers graph, a Power-law graph with 40 million vertices and 1.4 billion edges.

System Time
BIDMach/Kylix (64 nodes) 0.5 secs
PowerGraph (64 nodes) 3.6 secs
Hadoop (90 nodes) 250

The second benchmark is for the Yahoo Altavista Web crawl, a Power-law graph with 1.4 billion vertices and 6 billion edges.

System Time
BIDMach/Kylix (64 nodes) 2.5 secs
PowerGraph (64 nodes) 7 secs
Hadoop (90 nodes) 1000

BIDMach with Kylix is 3-7x faster than the next-fastest system for this problem, PowerGraph, on a 64-node cluster. This gap should grow on larger clusters thanks to the better asymptotic performance of butterflys.

[1] Huasha Zhao and John Canny, Butterfly Mixing: Accelerating Incremental-Update Algorithms on Clusters, Proc. 2013 SIAM International Conference on Data Mining (SDM 2013)

[2] H. Zhao and J. Canny, Kylix: A Sparse Allreduce for Commodity Clusters, in Int. Conf. on Parallel Processing 2014.

(1) Allreduce can be implemented in the MapReduce Framework but requires two reduce stages. Our approach provides allreduce as a single-stage primitive.

Clone this wiki locally