Performance comparison on matrices multiply

This post records the content we conclude in our experiments ##Experiments Environment Spark 1.0.1 standalone cluster： 1 Master. CPU: Intel(R) 6-Core Xeon(R) X5660 2.80GHz, Memory: 32GB 16 Workers. CPU: Intel(R) Quad Core E5620 Xeon(R) 2.4Ghz , Memory: 24GB Hadoop 2.3.0

##Experiments Data When two large-scale matrices multiply, the method used in multiplication refers to the CARMA algorithm, but unlike the MPI implementation in that paper, implementation on top of Spark cannot avoid shuffle phase, which will influence the performance. Elements in each row of matrix is a random float number from 0 to 5, both martrices are dense. The table below is the info about the experiments data

Each martrix rows by columns	Two matrix file whole size
10000 by 10000	1.72GB
15000 by 15000	3.78GB
20000 by 20000	6.7GB
25000 by 25000	10.48GB
30000 by 30000	15.08GB
35000 by 35000	20.54GB

##Experimental Results As we know, Spark use breeze as its linear algebra library, while breeze use netlib-jave to load native linear algebra library to compute operations. The performance between Java and native linear algebra library (e.g BLAS) is significant. Here we compare three conditions: Spark using our marlin to multiply the two matrices, breeze locally on a worker multiply the two matrices, Java locally on a worker multiply the two matrices. The former two conodition successfully load native BLAS, while the later one just use JVM. You can find steps here how to load native linear algebra library.

Below is the result:

	each matrix size:10000 by 10000	each matrix size:15000 by 15000	each matrix size:20000 by 20000	each matrix size:25000 by 25000	each matrix size:30000 by 30000	each matrix size:35000 by 35000
Marlin(with BLAS)	33s	72s	145s	320s	610s	980s
Single-Node Breeze(with BLAS)	240s	920s	2220s	4365s	NA(out of memory)	NA
Single-Node Java	1560s	NA	NA	NA	NA	NA

If the two input matrices are not both large, the method used in multiplication could be broadcast-approach. Here we compare the performance of Marlin to ScaLAPACK(use SUMMA algorithm), HAMA, single-node R in the same development environment. You can see the performance compariance on three different cases below:

Matrx dimension	Marlin	SUMMA	HAMA	R
512000x1000x1000	5s	10.6s	1250s	148s
1024000x1000x1000	10s	20.3s	3000s	297s
1536000x1000x1000	12s	29s	NA	906s
2048000x1000x1000	13s	39s	NA	3302s
2560000x1000x1000	16s	79s	NA	NA
65536x128x65536	7s	7s	NA	1163s
81920x128x81920	8s	9.7s	NA	NA
98304x128x98304	10s	15s	NA	NA
114688x128x114688	13s	17s	NA	NA
131072x128x131072	16s	21.4s	NA	NA

Alt text

When you load DenseVecMatrix from files, we strongly suggest you set spark.default.parallelism according to your environment to make sure every executor has 2 or 3 more tasks, especially when doing no so large matrices multiplication.

When using Marlin, if Spark cannot load native linear algebra library successfully, local Java will be used to compute instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance comparison on matrices multiply

Clone this wiki locally