-
Notifications
You must be signed in to change notification settings - Fork 168
Benchmarks
We recently ran some fresh benchmarks for Spark v1.1 and Graphlab clusters, and included some updated numbers from other recent published benchmarks.
The algorithms and datasets are listed below.
- Logistic Regression for text classification (RCV1)
- Logistic Regression for click prediction (Criteo)
- K-means (MNIST 8M digit recognition)
- Matrix Factorization (Netflix dataset)
- Topic Modeling (NYTimes dataset)
- Random Forests (Year Prediction dataset)
- PageRank on social graphs (Twitter Followers)
- PageRank on a web graph (AltaVista crawl)
RCV1-v2 (Reuters news data, LYR2004 distribution) benchmarks are for OAA (One Against All) classification, since RCV1-v2 has 103 independent topic labels. RCV1-v2 is a small dataset (0.5 GB).
BIDMach was run on a single Amazon g2.xlarge instance, while Spark was run on a cluster of m3.2xlarge high-memory instances. The other systems were run on an 8-code Intel E-2660 system. All the systems here that had support for Intel MKL were compiled with MKL support linked in.
Logistic Regression (sparse) on RCV1
System | Nodes/Cores | nclasses | AUC | Time | Cost | Energy(KJ) |
Spark | 18/72 | 1 | 0.93 | 30s | $0.07 | 120 |
BIDMach | 1 | 103 | 0.96 | 14s | $0.002 | 3 |
VW | 1/8 | 103 | 0.96 | 130s | $0.02 | 30 |
Scikit-Learn | 1/8 | 103 | 0.96 | 576s | $0.08 | 120 |
VW = Vowpal Wabbit
Accuracy (AUC - higher is better) is quoted only for topic 6, which is roughly balanced in +/- instances. The accuracy across all classes is somewhat higher, but tracks the topic 6 accuracy.
BIDMach on one GPU-accelerated node was faster than Spark on the 18-node cluster, while training 103x as many models. BIDMach was faster when only training a single model, this time about 4x faster than Spark on the cluster.
All the systems were run as OAA classifiers (multiclass) except Spark. As of Spark 1.2, there is no support yet for multi-class classification. Spark accuracy was lower than the other systems after 3 passes over the dataset, perhaps due to Spark's SGD implementation, which does only a single SGD update for each pass over the dataset. The other systems make one update per minibatch.
The reference for RCV1-v2 is:
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
Criteo released a medium-sized (12 GB) dataset with a single target (click, no click) with a very sparse set of features. This is representative of many click prediction tasks in industry.
System | Nodes/Cores | npasses | AUC | Time | Cost | Energy(KJ) |
Spark | 8/32 | 10 | 0.62 | 964s | $0.64 | 1500 |
Spark | 32/128 | 10 | 0.62 | 400s | $1.00 | 2500 |
BIDMach | 1 | 1 | 0.66 | 81s | $0.01 | 6 |
BIDMach | 1 | 10 | 0.72 | 805s | $0.13 | 60 |
BIDMach was run as before on a single Amazon g2.xlarge instance, while Spark 1.1 was run on a cluster of m3.2xlarge high-memory instances. Once again Spark SGD did not seem to perform well, and running more iterations did not improve it. Random guessing yields an AUC of 0.5, so BIDMach's score (0.72) represents about twice the lift of Spark (0.62). BIDMach was around 10x faster on a per-node basis. Single-target logistic regression is a worst-case for GPU acceleration, but BIDMach's numbers were quite respectable on this problem.
The dataset is available here:
http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/
The MNIST dataset is a medium-sized (25 GB) collection of images of handwritten digits. The dataset has 8 million 28x28 images labeled 0-9. K-means was run for 20 iterations.
System | nodes/cores | nclusters | SumSq | Time | Cost | Energy(KJ) |
Spark | 32/128 | 256 | 1.5e13 | 180s | $0.45 | 1150 |
BIDMach | 1 | 256 | 1.44e13 | 220s | $0.04 | 60 |
Spark | 96/384 | 4096 | 1.05e13 | 1100 | $9.00 | 22000 |
BIDMach | 1 | 4096 | 0.995e13 | 735 | $0.12 | 140 |
The reference for the dataset is:
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
The dataset is a small (about 4GB) collection of movie reviews from 0.5 million users on 17770 movies. Sparse matrix factorization was used, either a batch version (also known as Alternating Least Squares) and an SGD implementation.
System | Nodes/cores | Dim | RMSE | Time | Cost | Energy (KJ) |
Graphlab | 18/576 | 100 | 376 | $3.50 | 10,000 | |
Spark | 32/128 | 100 | 0.82 | 146 | $0.40 | 1000 |
BIDMach | 1 | 100 | 0.83 | 90 | $0.015 | 20 |
Spark | 32/128 | 200 | 0.82 | 544 | $1.45 | 3500 |
BIDMach | 1 | 200 | 0.83 | 129 | $0.02 | 30 |
BIDMach | 1 | 500 | 0.83 | 600 | $0.10 | 150 |
Latent Dirichlet Allocation (sparse) on the NY Times dataset. BIDMach includes several implementations of LDA: An online Variational Bayes (VB) [3], a batch variational Bayes [1], and a "cooled" Gibbs sampler implementation which is described in a forthcoming paper [6]. We compared against 4 systems, which are all custom LDA implementations:
- David's Blei's original C++ implementation of VB [1]
- Thomas Hoffmann's implementation of Collapsed Gibbs Sampling in Matlab/C [2]
- Yan et al.'s implementation of Collapsed Gibbs Sampling with GPU acceleration [4]
- Ahmed et al.'s implementation of VB using a customized cluster implementation [5]
System | Time | Iterations | Data Throughput | Gflops | Mflops/W |
BIDMach online VB (680 GPU) | 40 secs | 2 | 40 MB/s | 25 | 250 |
BIDMach Cooled GS (680 GPU) | 90 secs | 3 | 20 MB/s | 30 | 300 |
BIDMach batch VB (680 GPU) | 400 secs | 20 | 50 MB/s | 25 | 250 |
Yan GPU Collapsed GS | 5400 secs | 1000 | 100 MB/s | 0.4 (est.) | 4 |
Blei batch VB | 252000 secs | 20 | 0.1 MB/s | 0.05 (est.) | 0.5 |
Griffiths Collapsed GS | 225000 secs | 1000 | 2 MB/s | 8 Mflops (est.) | 0.08 |
To compare with the Yahoo_LDA cluster algorithm [5] which used proprietary news articles, we constructed a repeating stream of NYTimes articles as per [5]. There were two problem instances, first for 200 million articles, 256 topics:
System | Time | Iterations | Data Throughput | Gflops | Mflops/W |
BIDMach Cooled GS (1 node) | 30,000 secs | 2 | 10 MB/s | 30 | 300 |
Yahoo_LDA (100 nodes) | 120,000 secs | 1000 | 1300 MB/s | 8 (est.) | 0.8 |
The second for 2 billion articles, 256 topics:
System | Time | Iterations | Data Throughput | Gflops | Mflops/W |
BIDMach Cooled GS (1 node) | 300,000 secs | 2 | 10 MB/s | 30 | 300 |
Yahoo_LDA (1000 nodes) | 240,000 secs | 1000 | 7000 MB/s | 40 (est.) | 0.4 |
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
[2] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.
[3] M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation. In NIPS, volume 2, page 5, 2010.
[4] F. Yan, N. Xu, and Y. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, volume 9, pages 2134–2142, 2009.
[5] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123–132. ACM, 2012.
[6] H. Zhao, and J. Canny, Cooled Gibbs Parameter Estimation, (under review) 2014.
The benchmarks so far have been on problems that involve "small" models in the Megabyte range. Many other problems are naturally "graph" problems that don't decompose into minibatches. These problems typically decompose into a Map operation and an Allreduce (a special kind of allreduce where data is redistributed^(1)). Most data frameworks (Hadoop, Spark, Graphlab) use a reduce operation which isnt scalable [2]. In these systems, messages are exchanged all-to-all in a network with N nodes, and message size shrinks as O(1/N) (fixed amount of data per node) or O(1/N^2) (fixed total dataset size). Eventually messages become too small to be sent efficiently. Butterfly Allreduce primitives overcome this problem [1], [2], and provide benefits of error-tolerance and speed.
The table below shows the performance of BIDMach running on a cluster of 64 nodes, for one iteration of the PageRank algorithm, using the "Kylix" allreduce primitive [2]. The first benchmark is for the Twitter Followers graph, a Power-law graph with 40 million vertices and 1.4 billion edges.
System | Time |
BIDMach/Kylix (64 nodes) | 0.5 secs |
PowerGraph (64 nodes) | 3.6 secs |
Hadoop (90 nodes) | 250 |
The second benchmark is for the Yahoo Altavista Web crawl, a Power-law graph with 1.4 billion vertices and 6 billion edges.
System | Time |
BIDMach/Kylix (64 nodes) | 2.5 secs |
PowerGraph (64 nodes) | 7 secs |
Hadoop (90 nodes) | 1000 |
BIDMach with Kylix is 3-7x faster than the next-fastest system for this problem, PowerGraph, on a 64-node cluster. This gap should grow on larger clusters thanks to the better asymptotic performance of butterflys.
[1] Huasha Zhao and John Canny, Butterfly Mixing: Accelerating Incremental-Update Algorithms on Clusters, Proc. 2013 SIAM International Conference on Data Mining (SDM 2013)
[2] H. Zhao and J. Canny, Kylix: A Sparse Allreduce for Commodity Clusters, to appear in Int. Conf. on Parallel Processing 2014.
(1) Allreduce can be implemented in the MapReduce Framework but requires two reduce stages. Our approach provides allreduce as a single-stage primitive.