Skip to content

Performance comparison on matrices multiply

Myasuka edited this page Nov 18, 2014 · 17 revisions

This post records the content we conclude in our experiments ##Experiments Environment Spark 1.0.1 standalone cluster: 1 Master. CPU: Intel(R) 6-Core Xeon(R) X5660 2.80GHz, Memory: 32GB 16 Workers. CPU: Intel(R) Quad Core E5620 Xeon(R) 2.4Ghz , Memory: 24GB Hadoop 2.3.0

##Experiments Data When two large-scale matrices multiply, the method used in multiplication refers to the CARMA algorithm, but unlike the MPI implementation in that paper, implementation on top of Spark cannot avoid shuffle phase, which will influence the performance. Elements in each row of matrix is a random float number from 0 to 5, both martrices are dense. The table below is the info about the experiments data

Each martrix rows by columns Two matrix file whole size
10000 by 10000 1.72GB
15000 by 15000 3.78GB
20000 by 20000 6.7GB
25000 by 25000 10.48GB
30000 by 30000 15.08GB
35000 by 35000 20.54GB

##Experimental Results As we know, Spark use breeze as its linear algebra library, while breeze use netlib-jave to load native linear algebra library to compute operations. The performance between Java and native linear algebra library (e.g BLAS) is significant. Here we compare three conditions: Spark using our marlin to multiply the two matrices, breeze locally on a worker multiply the two matrices, Java locally on a worker multiply the two matrices. The former two conodition successfully load native BLAS, while the later one just use JVM. You can find steps here how to load native linear algebra library.

Below is the result:

each matrix size:10000 by 10000 each matrix size:15000 by 15000 each matrix size:20000 by 20000 each matrix size:25000 by 25000 each matrix size:30000 by 30000 each matrix size:35000 by 35000
Marlin(with BLAS) 33s 72s 145s 320s 610s 980s
Single-Node Breeze(with BLAS) 240s 920s 2220s 4365s NA(out of memory) NA
Single-Node Java 1560s NA NA NA NA NA

If the two input matrices are not both large, the method used in multiplication could be broadcast-approach. Here we compare the performance of Marlin to ScaLAPACK(use SUMMA algorithm), HAMA, single-node R in the same development environment. You can see the performance compariance on three different cases below:

Matrx dimension Marlin SUMMA HAMA R
512000x1000x1000 5s 10.6s 1250s 148s
1024000x1000x1000 10s 20.3s 3000s 297s
1536000x1000x1000 12s 29s NA 906s
2048000x1000x1000 13s 39s NA 3302s
2560000x1000x1000 16s 79s NA NA
65536x128x65536 7s 7s NA 1163s
81920x128x81920 8s 9.7s NA NA
98304x128x98304 10s 15s NA NA
114688x128x114688 13s 17s NA NA
131072x128x131072 16s 21.4s NA NA

Alt text

When you load DenseVecMatrix from files, we strongly suggest you set spark.default.parallelism according to your environment to make sure every executor has 2 or 3 more tasks, especially when doing no so large matrices multiplication.

When using Marlin, if Spark cannot load native linear algebra library successfully, local Java will be used to compute instead.