[SPARK-1390] Refactoring of matrices backed by RDDs #296

mengxr · 2014-04-02T05:37:13Z

This is to refactor interfaces for matrices backed by RDDs. It would be better if we have a clear separation of local matrices and those backed by RDDs. Right now, we have

org.apache.spark.mllib.linalg.SparseMatrix, which is a wrapper over an RDD of matrix entries, i.e., coordinate list format.
org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix, which is a wrapper over RDD[Array[Double]], i.e. row-oriented format.

We will see naming collision when we introduce local SparseMatrix, and the name TallSkinnyDenseMatrix is not exact if we switch to RDD[Vector] from RDD[Array[Double]]. It would be better to have "RDD" in the class name to suggest that operations may trigger jobs.

The proposed names are (all under org.apache.spark.mllib.linalg.rdd):

RDDMatrix: trait for matrices backed by one or more RDDs
CoordinateRDDMatrix: wrapper of RDD[(Long, Long, Double)]
RowRDDMatrix: wrapper of RDD[Vector] whose rows do not have special ordering
IndexedRowRDDMatrix: wrapper of RDD[(Long, Vector)] whose rows are associated with indices

The current code also introduces local matrices.

AmplabJenkins · 2014-04-02T05:37:23Z

Merged build triggered.

AmplabJenkins · 2014-04-02T05:37:29Z

Merged build started.

AmplabJenkins · 2014-04-02T05:38:58Z

Merged build finished.

AmplabJenkins · 2014-04-02T05:38:59Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13666/

AmplabJenkins · 2014-04-02T20:37:23Z

Merged build triggered.

AmplabJenkins · 2014-04-02T20:37:28Z

Merged build started.

AmplabJenkins · 2014-04-02T20:45:43Z

Merged build finished.

AmplabJenkins · 2014-04-02T20:45:43Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13690/

AmplabJenkins · 2014-04-02T20:52:23Z

Merged build triggered.

AmplabJenkins · 2014-04-02T20:52:28Z

Merged build started.

AmplabJenkins · 2014-04-02T20:57:24Z

Merged build triggered.

AmplabJenkins · 2014-04-02T20:57:34Z

Merged build started.

AmplabJenkins · 2014-04-02T21:01:19Z

Merged build finished.

AmplabJenkins · 2014-04-02T21:01:19Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13691/

AmplabJenkins · 2014-04-02T21:05:56Z

Merged build finished.

AmplabJenkins · 2014-04-02T21:05:56Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13692/

add tests for matrices

AmplabJenkins · 2014-04-07T00:07:23Z

Merged build triggered.

AmplabJenkins · 2014-04-07T00:07:29Z

Merged build started.

AmplabJenkins · 2014-04-07T01:05:03Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-07T01:05:03Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13827/

AmplabJenkins · 2014-04-09T01:37:27Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13914/

mengxr · 2014-04-09T01:43:25Z

The failed test is from Bagel. I'll re-run Jenkins.

mengxr · 2014-04-09T01:43:34Z

Jenkins, retest this please.

AmplabJenkins · 2014-04-09T01:47:24Z

Build triggered.

AmplabJenkins · 2014-04-09T01:47:35Z

Build started.

AmplabJenkins · 2014-04-09T03:10:23Z

Build finished. All automated tests passed.

AmplabJenkins · 2014-04-09T03:10:24Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13920/

AmplabJenkins · 2014-04-09T03:42:23Z

Merged build triggered.

AmplabJenkins · 2014-04-09T03:42:30Z

Merged build started.

AmplabJenkins · 2014-04-09T03:47:35Z

Merged build finished.

AmplabJenkins · 2014-04-09T03:47:36Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13925/

AmplabJenkins · 2014-04-09T05:07:23Z

Merged build triggered.

AmplabJenkins · 2014-04-09T05:07:30Z

Merged build started.

AmplabJenkins · 2014-04-09T05:57:00Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-09T05:57:00Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13930/

This is to refactor interfaces for matrices backed by RDDs. It would be better if we have a clear separation of local matrices and those backed by RDDs. Right now, we have 1. `org.apache.spark.mllib.linalg.SparseMatrix`, which is a wrapper over an RDD of matrix entries, i.e., coordinate list format. 2. `org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix`, which is a wrapper over RDD[Array[Double]], i.e. row-oriented format. We will see naming collision when we introduce local `SparseMatrix`, and the name `TallSkinnyDenseMatrix` is not exact if we switch to `RDD[Vector]` from `RDD[Array[Double]]`. It would be better to have "RDD" in the class name to suggest that operations may trigger jobs. The proposed names are (all under `org.apache.spark.mllib.linalg.rdd`): 1. `RDDMatrix`: trait for matrices backed by one or more RDDs 2. `CoordinateRDDMatrix`: wrapper of `RDD[(Long, Long, Double)]` 3. `RowRDDMatrix`: wrapper of `RDD[Vector]` whose rows do not have special ordering 4. `IndexedRowRDDMatrix`: wrapper of `RDD[(Long, Vector)]` whose rows are associated with indices The current code also introduces local matrices. Author: Xiangrui Meng <[email protected]> Closes apache#296 from mengxr/mat and squashes the following commits: 24d8294 [Xiangrui Meng] fix for groupBy returning Iterable bfc2b26 [Xiangrui Meng] merge master 8e4f1f5 [Xiangrui Meng] Merge branch 'master' into mat 0135193 [Xiangrui Meng] address Reza's comments 03cd7e1 [Xiangrui Meng] add pca/gram to IndexedRowMatrix add toBreeze to DistributedMatrix for test simplify tests b177ff1 [Xiangrui Meng] address Matei's comments be119fe [Xiangrui Meng] rename m/n to numRows/numCols for local matrix add tests for matrices b881506 [Xiangrui Meng] rename SparkPCA/SVD to TallSkinnyPCA/SVD e7d0d4a [Xiangrui Meng] move IndexedRDDMatrixRow to IndexedRowRDDMatrix 0d1491c [Xiangrui Meng] fix test errors a85262a [Xiangrui Meng] rename RDDMatrixRow to IndexedRDDMatrixRow b8b6ac3 [Xiangrui Meng] Remove old code 4cf679c [Xiangrui Meng] port pca to RowRDDMatrix, and add multiply and covariance 7836e2f [Xiangrui Meng] initial refactoring of matrices backed by RDDs

…is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes #24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…is reused With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())apache#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())apache#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` Modified the existing test. Closes apache#24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]>

initial refactoring of matrices backed by RDDs

7836e2f

mengxr added 3 commits April 2, 2014 13:23

port pca to RowRDDMatrix, and add multiply and covariance

4cf679c

Remove old code

b8b6ac3

rename RDDMatrixRow to IndexedRDDMatrixRow

a85262a

fix test errors

0d1491c

move IndexedRDDMatrixRow to IndexedRowRDDMatrix

e7d0d4a

mengxr mentioned this pull request Apr 3, 2014

[WIP] [SPARK-1328] Add vector statistics #268

Closed

mengxr added 2 commits April 6, 2014 16:23

rename SparkPCA/SVD to TallSkinnyPCA/SVD

b881506

rename m/n to numRows/numCols for local matrix

be119fe

add tests for matrices

mengxr changed the title ~~[SPARK-1390] [WIP] Refactoring of matrices backed by RDDs~~ [SPARK-1390] Refactoring of matrices backed by RDDs Apr 7, 2014

mengxr added 2 commits April 8, 2014 20:39

Merge branch 'master' into mat

8e4f1f5

merge master

bfc2b26

fix for groupBy returning Iterable

24d8294

asfgit closed this in 9689b66 Apr 9, 2014

mengxr deleted the mat branch April 9, 2014 06:58

lins05 pushed a commit to lins05/spark that referenced this pull request May 30, 2017

Add missing license (apache#296)

fe03c7c

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

Add missing license (apache#296)

4d4819c

gatesn pushed a commit to gatesn/spark that referenced this pull request Mar 14, 2018

Bump jetty version to 9.4.8.v20171121 (apache#296)

99e6ab2

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

manila-provisioner job: add devstack-plugin-ceph (apache#296)

c15a058

[SPARK-1390] Refactoring of matrices backed by RDDs #296

[SPARK-1390] Refactoring of matrices backed by RDDs #296

Conversation

mengxr commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 7, 2014

AmplabJenkins commented Apr 7, 2014

AmplabJenkins commented Apr 7, 2014

AmplabJenkins commented Apr 7, 2014

AmplabJenkins commented Apr 9, 2014

mengxr commented Apr 9, 2014

mengxr commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014

AmplabJenkins commented Apr 9, 2014