Skip to content

Benchmarks

jcanny edited this page Feb 11, 2015 · 185 revisions

Table of Contents

Overview (Updated 2/9/15)

We recently ran some fresh benchmarks for Spark v1.1 and Graphlab clusters, and included some updated numbers from other recent published benchmarks.

The algorithms and datasets are listed below.

  • Logistic Regression for text classification (RCV1)
  • Logistic Regression for click prediction (Criteo)
  • K-means (MNIST 8M digit recognition)
  • Matrix Factorization (Netflix dataset)
  • Topic Modeling (NYTimes dataset)
  • Random Forests (Year Prediction dataset)
  • PageRank on social graphs (Twitter Followers)
  • PageRank on a web graph (AltaVista crawl)

Logistic Regression

Reuters Data

RCV1-v2 (Reuters news data, LYR2004 distribution) benchmarks are for OAA (One Against All) classification, since RCV1-v2 has 103 independent topic labels. RCV1-v2 is a small dataset (0.5 GB).

BIDMach was run on a single Amazon g2.xlarge instance, while Spark was run on a cluster of m3.2xlarge high-memory instances. The other systems were run on an 8-code Intel E-2660 system. All the systems that had support for Intel MKL were compiled with MKL support linked in. Energy and cost are estimates only based on a basic CPU (m3.2xlarge) or GPU (g2.2xlarge) EC2 instance.

Logistic Regression (sparse) on RCV1

System Nodes/Cores nclasses AUC Time Cost Energy(KJ)
Spark 18/72 1 0.93 30s $0.07 120
BIDMach 1 103 0.96 14s $0.002 3
VW 1/8 103 0.96 130s $0.02 30
Scikit-Learn 1/8 103 0.96 576s $0.08 120
Liblinear 1/8 103 0.96 250s $0.04 60

VW = Vowpal Wabbit

Accuracy (AUC - higher is better) is quoted only for topic 6, which is roughly balanced in +/- instances. The accuracy across all classes is somewhat higher, but tracks the topic 6 accuracy.

All the systems were run as OAA classifiers (multiclass) except Spark and Liblinear. Neither of these systems supports multi-label classification directly. Liblinear was trained separately for each topic.

Spark accuracy was lower than the other systems after 3 passes over the dataset, perhaps due to Spark's SGD implementation, which does only a single SGD update for each pass over the dataset. The other systems make one update per minibatch.

VW has a fast SGD implementation and uses hardware acceleration (Intel intrinsics) and is probably the fastest single-machine system in use today. See e.g. http://fastml.com/vowpal-wabbit-liblinear-sbm-and-streamsvm-compared/ BIDMach with GPU is an order of magnitude faster than any of the single-machine systems, and outperforms Spark on a 30-node cluster while training 103x as many models.

The reference for RCV1-v2 is:

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.

Criteo Dataset

Criteo released a medium-sized (12 GB) dataset with a single target (click, no click) with a very sparse set of features. This is representative of many click prediction tasks in industry.

System Nodes/Cores npasses AUC Time Cost Energy(KJ)
Spark 8/32 10 0.62 964s $0.64 1500
Spark 32/128 10 0.62 400s $1.00 2500
BIDMach 1 1 0.66 81s $0.01 6
BIDMach 1 10 0.72 805s $0.13 60

BIDMach was run as before on a single Amazon g2.xlarge instance, while Spark 1.1 was run on a cluster of m3.2xlarge high-memory instances. Once again Spark SGD did not seem to perform well, and running more iterations did not improve it. Random guessing yields an AUC of 0.5, so BIDMach's score (0.72) represents about twice the lift of Spark (0.62). BIDMach was around 10x faster on a per-node basis. Single-target logistic regression is a worst-case for GPU acceleration, but BIDMach's numbers were quite respectable on this problem.

The dataset is available here:

http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/

K-Means

MNIST-8M dataset

The MNIST dataset is a medium-sized (25 GB) collection of images of handwritten digits. The dataset has 8 million 28x28 images labeled 0-9. K-means was run for 20 iterations.

System nodes/cores nclusters SumSq Time Cost Energy(KJ)
Spark 32/128 256 1.5e13 180s $0.45 1150
BIDMach 1 256 1.44e13 220s $0.04 60
Spark 96/384 4096 1.05e13 1100 $9.00 22000
BIDMach 1 4096 0.995e13 735 $0.12 140

The reference for the dataset is:

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.

Matrix Factorization

Full Netflix Dataset

The dataset is a small (about 4GB) collection of movie reviews from 0.5 million users on 17770 movies. Sparse matrix factorization was used, either a batch version (also known as Alternating Least Squares) and an SGD implementation. We compared against Spark 1.1 and the archived graphlab open source Distribution. The new commercial version of graphlab doesnt yet support distributed computing.

BIDMach was run on a g2.2xlarge instance, Spark on an m3.2xlarge cluster, and Graphlab on an cc2.8xlarge cluster (the default in the graphlab EC2 script).

System Nodes/cores Dim RMSE Time Cost Energy (KJ)
Graphlab 18/576 100 376 $3.50 10,000
Spark 32/128 100 0.82 146 $0.40 1000
BIDMach 1 100 0.83 90 $0.015 20
Spark 32/128 200 0.82 544 $1.45 3500
BIDMach 1 200 0.83 129 $0.02 30
BIDMach 1 500 0.83 600 $0.10 150

Latent Dirichlet Allocation (LDA)

UCI NY Times dataset

BIDMach includes several implementations of LDA: An online Variational Bayes (VB) [3], a batch variational Bayes [1], and a "cooled" Gibbs sampler implementation which is described in a forthcoming paper [6]. We compared against 4 systems, which are all custom LDA implementations:

  1. David's Blei's original C++ implementation of VB [1]
  2. Thomas Hoffmann's implementation of Collapsed Gibbs Sampling in Matlab/C [2]
  3. Yan et al.'s implementation of Collapsed Gibbs Sampling with GPU acceleration [4]
  4. Ahmed et al.'s implementation of VB using a customized cluster implementation [5]
To the best of our knowledge, 3. was the fastest single-machine implementation of LDA other than BIDMach, and 4. was the fastest cluster implementation. 4. has been heavily optimized for LDA, and should be significantly faster than other cluster implementations. All systems were trained to convergence for that algorithm. Not all reached the same level of perplexity, and cooled GS reached the highest perplexity score [6]. The very different numbers of iterations to convergence were due to the use of parameter cooling as described in [6].
System Time Iterations Data Throughput Gflops Mflops/W
BIDMach online VB (680 GPU) 40 secs 2 40 MB/s 25 250
BIDMach Cooled GS (680 GPU) 90 secs 3 20 MB/s 30 300
BIDMach batch VB (680 GPU) 400 secs 20 50 MB/s 25 250
Yan GPU Collapsed GS 5400 secs 1000 100 MB/s 0.4 (est.) 4
Blei batch VB 252000 secs 20 0.1 MB/s 0.05 (est.) 0.5
Griffiths Collapsed GS 225000 secs 1000 2 MB/s 8 Mflops (est.) 0.08

To compare with the Yahoo_LDA cluster algorithm [5] which used proprietary news articles, we constructed a repeating stream of NYTimes articles as per [5]. There were two problem instances, first for 200 million articles, 256 topics:

System Time Iterations Data Throughput Gflops Mflops/W
BIDMach Cooled GS (1 node) 30,000 secs 2 10 MB/s 30 300
Yahoo_LDA (100 nodes) 120,000 secs 1000 1300 MB/s 8 (est.) 0.8

The second for 2 billion articles, 256 topics:

System Time Iterations Data Throughput Gflops Mflops/W
BIDMach Cooled GS (1 node) 300,000 secs 2 10 MB/s 30 300
Yahoo_LDA (1000 nodes) 240,000 secs 1000 7000 MB/s 40 (est.) 0.4

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.

[2] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.

[3] M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation. In NIPS, volume 2, page 5, 2010.

[4] F. Yan, N. Xu, and Y. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, volume 9, pages 2134–2142, 2009.

[5] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123–132. ACM, 2012.

[6] H. Zhao, and J. Canny, Cooled Gibbs Parameter Estimation, (under review) 2014.

Random Forests

PageRank

The benchmarks so far have been on problems that involve "small" models in the Megabyte range. Many other problems are naturally "graph" problems that don't decompose into minibatches. These problems typically decompose into a Map operation and an Allreduce (a special kind of allreduce where data is redistributed^(1)). Most data frameworks (Hadoop, Spark, Graphlab) use a reduce operation which isnt scalable [2]. In these systems, messages are exchanged all-to-all in a network with N nodes, and message size shrinks as O(1/N) (fixed amount of data per node) or O(1/N^2) (fixed total dataset size). Eventually messages become too small to be sent efficiently. Butterfly Allreduce primitives overcome this problem [1], [2], and provide benefits of error-tolerance and speed.

The table below shows the performance of BIDMach running on a cluster of 64 nodes, for one iteration of the PageRank algorithm, using the "Kylix" allreduce primitive [2]. The first benchmark is for the Twitter Followers graph, a Power-law graph with 40 million vertices and 1.4 billion edges.

System Time
BIDMach/Kylix (64 nodes) 0.5 secs
PowerGraph (64 nodes) 3.6 secs
Hadoop (90 nodes) 250

The second benchmark is for the Yahoo Altavista Web crawl, a Power-law graph with 1.4 billion vertices and 6 billion edges.

System Time
BIDMach/Kylix (64 nodes) 2.5 secs
PowerGraph (64 nodes) 7 secs
Hadoop (90 nodes) 1000

BIDMach with Kylix is 3-7x faster than the next-fastest system for this problem, PowerGraph, on a 64-node cluster. This gap should grow on larger clusters thanks to the better asymptotic performance of butterflys.

[1] Huasha Zhao and John Canny, Butterfly Mixing: Accelerating Incremental-Update Algorithms on Clusters, Proc. 2013 SIAM International Conference on Data Mining (SDM 2013)

[2] H. Zhao and J. Canny, Kylix: A Sparse Allreduce for Commodity Clusters, to appear in Int. Conf. on Parallel Processing 2014.

(1) Allreduce can be implemented in the MapReduce Framework but requires two reduce stages. Our approach provides allreduce as a single-stage primitive.

Clone this wiki locally