-
Notifications
You must be signed in to change notification settings - Fork 46
Performance comparison on matrices multiply
This post records the content we conclude in our experiments ##Experiments Environment Spark 1.0.1 standalone cluster: 1 Master. CPU: Intel(R) 6-Core Xeon(R) X5660 2.80GHz, Memory: 32GB 16 Workers. CPU: Intel(R) Quad Core E5620 Xeon(R) 2.4Ghz , Memory: 24GB Hadoop 2.3.0
##Experiments Data When two large-scale matrices multiply, the method used in multiplication refers to the CARMA algorithm, but unlike the MPI implementation in that paper, implementation on top of Spark cannot avoid shuffle phase, which will influence the performance. Elements in each row of matrix is a random float number from 0 to 5, both martrices are dense. The table below is the info about the experiments data
Each martrix rows by columns | Two matrix file whole size |
10000 by 10000 | 1.72GB |
15000 by 15000 | 3.78GB |
20000 by 20000 | 6.7GB |
25000 by 25000 | 10.48GB |
30000 by 30000 | 15.08GB |
35000 by 35000 | 20.54GB |
##Experimental Results As we know, Spark use breeze as its linear algebra library, while breeze use netlib-jave to load native linear algebra library to compute operations. The performance between Java and native linear algebra library (e.g BLAS) is significant. Here we compare three conditions: Spark using our marlin to multiply the two matrices, breeze locally on a worker multiply the two matrices, Java locally on a worker multiply the two matrices. The former two conodition successfully load native BLAS, while the later one just use JVM. You can find steps here how to load native linear algebra library.
Below is the result:
each matrix size:10000 by 10000 | each matrix size:15000 by 15000 | each matrix size:20000 by 20000 | each matrix size:25000 by 25000 | each matrix size:30000 by 30000 | each matrix size:35000 by 35000 | |
Marlin(with BLAS) | 33s | 72s | 145s | 320s | 610s | 980s |
Single-Node Breeze(with BLAS) | 240s | 920s | 2220s | 4365s | NA(out of memory) | NA |
Single-Node Java | 1560s | NA | NA | NA | NA | NA |
If the two input matrices are not both large, the method used in multiplication could be broadcast-approach. Here we compare the performance of Marlin to ScaLAPACK(use SUMMA algorithm), HAMA, single-node R in the same development environment. You can see the performance compariance on three different cases below:
Matrx dimension | Marlin | SUMMA | HAMA | R |
512000x1000x1000 | 5s | 10.6s | 1250s | 148s |
1024000x1000x1000 | 10s | 20.3s | 3000s | 297s |
1536000x1000x1000 | 12s | 29s | NA | 906s |
2048000x1000x1000 | 13s | 39s | NA | 3302s |
2560000x1000x1000 | 16s | 79s | NA | NA |
65536x128x65536 | 7s | 7s | NA | 1163s |
81920x128x81920 | 8s | 9.7s | NA | NA |
98304x128x98304 | 10s | 15s | NA | NA |
114688x128x114688 | 13s | 17s | NA | NA |
131072x128x131072 | 16s | 21.4s | NA | NA |
When you load DenseVecMatrix from files, we strongly suggest you set spark.default.parallelism
according to your environment to make sure every executor has 2 or 3 more tasks, especially when doing no so large matrices multiplication.
When using Marlin, if Spark cannot load native linear algebra library successfully, local Java will be used to compute instead.