[SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc #1903

larryxiao · 2014-08-12T08:41:40Z

create verticesDeduplicate with reduceByKey, using mergeFunc
then proceed with verticesDedup

But this is not tested and I want to add a test on VertexRDD.apply,
because it need Edges, should I place it in VertexRDDSuite or else?

create verticesDeduplicate with reduceByKey, using mergeFunc then proceed with verticesDedup

AmplabJenkins · 2014-08-12T08:43:10Z

Can one of the admins verify this patch?

ankurdave · 2014-08-16T01:39:04Z

This isn't quite the right approach, since the call to reduceByKey will result in two rounds of communication and hash aggregations (in reduceByKey and copartitionWithVertices) when only one is necessary. It would be better to add a ShippableVertexPartition constructor that takes a mergeFunc, then just pass the mergeFunc from here into that constructor.

Also, the capitalization in VD1 and VD2 suggests that they are type parameters when they are actually function parameters -- they should probably just be a and b.

I can make these changes this weekend if you like.

larryxiao · 2014-08-16T03:13:17Z

Thanks
I can do it

larryxiao · 2014-08-19T02:51:43Z

It is ok now?

About testing: how do I test it? I think it should be added in GraphSuite. Is it necessary?

Thank you!

ankurdave · 2014-08-19T04:03:22Z

This looks good! A test would be good too. VertexRDDSuite seems like the right place, since nothing else actually calls this variant of VertexRDD.apply. It should be OK to create an empty EdgeRDD in VertexRDDSuite for testing purposes.

Here's a simple test:

val verts = sc.parallelize(List((0L, 1), (0L, 2), (1L, 3)))
val edges = EdgeRDD.fromEdges(sc.parallelize(List.empty[Edge[Int]]))
val rdd = VertexRDD(verts, edges, 0, (a: Int, b: Int) => a + b)
assert(rdd.collect.toSet == Set((0L, 3), (1L, 3)))

larryxiao · 2014-08-19T04:52:25Z

Thank you Ankur!

I'll add test to it.

ankurdave · 2014-09-03T01:38:58Z

ok to test

SparkQA · 2014-09-03T01:44:21Z

QA tests have started for PR 1903 at commit e4ca697.

This patch merges cleanly.

SparkQA · 2014-09-03T01:45:22Z

QA tests have finished for PR 1903 at commit e4ca697.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

ankurdave · 2014-09-03T01:56:29Z

graphx/src/main/scala/org/apache/spark/graphx/impl/ShippableVertexPartition.scala

+   * and merging duplicate vertex atrribute with mergeFunc.
+   */
+  def apply[VD: ClassTag](
+    iter: Iterator[(VertexId, VD)], routingTable: RoutingTablePartition, defaultVal: VD, mergeFunc: (VD, VD) => VD)


Looks like this line is too long - it would be great if you could wrap it. Also, I think the Spark style is for parameter lists to be indented 4 spaces instead of 2.

SparkQA · 2014-09-03T02:14:13Z

QA tests have started for PR 1903 at commit 1c70366.

This patch merges cleanly.

SparkQA · 2014-09-03T03:11:56Z

QA tests have finished for PR 1903 at commit 1c70366.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-05T23:43:25Z

Can one of the admins verify this patch?

a copy of vertices with defaultVal is created before, and it's b in (a, b) => b see in VertexPartition.scala val fullIter = iter ++ routingTable.iterator.map(vid => (vid, defaultVal))

larryxiao · 2014-09-16T09:01:07Z

As described in commit message:

a copy of vertices with defaultVal is created before, and it's b in
(a, b) => b

see in VertexPartition.scala
val fullIter = iter ++ routingTable.iterator.map(vid => (vid, defaultVal))

So there's hidden rule that default mergeFunc should be (a, b) => a.
Should I write a comment to let user know about this?

ankurdave · 2014-09-16T09:31:01Z

ok to test

ankurdave · 2014-09-16T09:31:39Z

Yeah, a note about that default would be great.

SparkQA · 2014-09-16T09:34:16Z

QA tests have started for PR 1903 at commit dfdb3c9.

This patch merges cleanly.

SparkQA · 2014-09-16T10:40:54Z

QA tests have finished for PR 1903 at commit dfdb3c9.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

ankurdave · 2014-09-16T10:45:14Z

This looks good! I'll merge it pending the doc update.

SparkQA · 2014-09-16T23:59:26Z

QA tests have started for PR 1903 at commit 614059f.

This patch merges cleanly.

SparkQA · 2014-09-17T01:07:59Z

QA tests have finished for PR 1903 at commit 614059f.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

ankurdave · 2014-09-17T18:21:44Z

Your doc update made me realize that now that we're taking a mergeFunc in ShippableVertexPartition.initFrom, it's not ideal to use the iterator concatenation approach for setting the default values anymore, because the mergeFunc will get run on the default value, which might surprise users. I submitted a PR (larryxiao#1) to avoid this by doing the merge first, then populating the default values.

ShippableVertexPartition.initFrom: Don't run mergeFunc on default values

larryxiao · 2014-09-18T01:41:28Z

Thanks Ankur!
I learn something :)

ankurdave · 2014-09-18T07:35:56Z

Jenkins, test this please.

SparkQA · 2014-09-18T07:39:17Z

QA tests have started for PR 1903 at commit 625aa9d.

This patch merges cleanly.

SparkQA · 2014-09-18T08:30:21Z

QA tests have finished for PR 1903 at commit 625aa9d.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

ankurdave · 2014-09-18T18:33:14Z

Unrelated failure in Streaming. Jenkins, retest this please.

SparkQA · 2014-09-18T18:39:20Z

QA tests have started for PR 1903 at commit 625aa9d.

This patch merges cleanly.

SparkQA · 2014-09-18T19:44:47Z

QA tests have finished for PR 1903 at commit 625aa9d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

ankurdave · 2014-09-19T06:38:55Z

Thanks! Merged into master and branch-1.1.

VertexRDD.apply had a bug where it ignored the merge function for duplicate vertices and instead used whichever vertex attribute occurred first. This commit fixes the bug by passing the merge function through to ShippableVertexPartition.apply, which merges any duplicates using the merge function and then fills in missing vertices using the specified default vertex attribute. This commit also adds a unit test for VertexRDD.apply. Author: Larry Xiao <[email protected]> Author: Blie Arkansol <[email protected]> Author: Ankur Dave <[email protected]> Closes #1903 from larryxiao/2062 and squashes the following commits: 625aa9d [Blie Arkansol] Merge pull request #1 from ankurdave/SPARK-2062 476770b [Ankur Dave] ShippableVertexPartition.initFrom: Don't run mergeFunc on default values 614059f [Larry Xiao] doc update: note about the default null value vertices construction dfdb3c9 [Larry Xiao] minor fix 1c70366 [Larry Xiao] scalastyle check: wrap line, parameter list indent 4 spaces e4ca697 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc 6a35ea8 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc 4fbc29c [Blie Arkansol] undo unnecessary change efae765 [Larry Xiao] fix mistakes: should be able to call with or without mergeFunc b2422f9 [Larry Xiao] Merge branch '2062' of github.com:larryxiao/spark into 2062 52dc7f7 [Larry Xiao] pass mergeFunc to VertexPartitionBase, where merge is handled 581e9ee [Larry Xiao] TODO: VertexRDDSuite 20d80a3 [Larry Xiao] [SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc (cherry picked from commit 3bbbdd8) Signed-off-by: Ankur Dave <[email protected]>

larryxiao added 2 commits August 12, 2014 16:35

[SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc

20d80a3

create verticesDeduplicate with reduceByKey, using mergeFunc then proceed with verticesDedup

TODO: VertexRDDSuite

581e9ee

larryxiao added 3 commits August 19, 2014 09:23

pass mergeFunc to VertexPartitionBase, where merge is handled

52dc7f7

Merge branch '2062' of github.com:larryxiao/spark into 2062

b2422f9

fix mistakes: should be able to call with or without mergeFunc

efae765

undo unnecessary change

4fbc29c

[TEST] VertexRDD.apply mergeFunc

6a35ea8

larryxiao force-pushed the 2062 branch from 2d77956 to 78504b7 Compare August 25, 2014 02:37

[TEST] VertexRDD.apply mergeFunc

e4ca697

larryxiao force-pushed the 2062 branch from 78504b7 to e4ca697 Compare August 25, 2014 02:40

ankurdave reviewed Sep 3, 2014
View reviewed changes

scalastyle check: wrap line, parameter list indent 4 spaces

1c70366

minor fix

dfdb3c9

a copy of vertices with defaultVal is created before, and it's b in (a, b) => b see in VertexPartition.scala val fullIter = iter ++ routingTable.iterator.map(vid => (vid, defaultVal))

larryxiao force-pushed the 2062 branch from 4ec15b4 to dfdb3c9 Compare September 16, 2014 08:54

doc update: note about the default null value vertices construction

614059f

larryxiao force-pushed the 2062 branch from e9f8802 to 614059f Compare September 16, 2014 23:55

ShippableVertexPartition.initFrom: Don't run mergeFunc on default values

476770b

Merge pull request #1 from ankurdave/SPARK-2062

625aa9d

ShippableVertexPartition.initFrom: Don't run mergeFunc on default values

asfgit closed this in 3bbbdd8 Sep 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc #1903

[SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc #1903

larryxiao commented Aug 12, 2014

AmplabJenkins commented Aug 12, 2014

ankurdave commented Aug 16, 2014

larryxiao commented Aug 16, 2014

larryxiao commented Aug 19, 2014

ankurdave commented Aug 19, 2014

larryxiao commented Aug 19, 2014

ankurdave commented Sep 3, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 3, 2014

ankurdave Sep 3, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 5, 2014

larryxiao commented Sep 16, 2014

ankurdave commented Sep 16, 2014

ankurdave commented Sep 16, 2014

SparkQA commented Sep 16, 2014

SparkQA commented Sep 16, 2014

ankurdave commented Sep 16, 2014

SparkQA commented Sep 16, 2014

SparkQA commented Sep 17, 2014

ankurdave commented Sep 17, 2014

larryxiao commented Sep 18, 2014

ankurdave commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

ankurdave commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

ankurdave commented Sep 19, 2014

[SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc #1903

[SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc #1903

Conversation

larryxiao commented Aug 12, 2014

AmplabJenkins commented Aug 12, 2014

ankurdave commented Aug 16, 2014

larryxiao commented Aug 16, 2014

larryxiao commented Aug 19, 2014

ankurdave commented Aug 19, 2014

larryxiao commented Aug 19, 2014

ankurdave commented Sep 3, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 3, 2014

ankurdave Sep 3, 2014

Choose a reason for hiding this comment

SparkQA commented Sep 3, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 5, 2014

larryxiao commented Sep 16, 2014

ankurdave commented Sep 16, 2014

ankurdave commented Sep 16, 2014

SparkQA commented Sep 16, 2014

SparkQA commented Sep 16, 2014

ankurdave commented Sep 16, 2014

SparkQA commented Sep 16, 2014

SparkQA commented Sep 17, 2014

ankurdave commented Sep 17, 2014

larryxiao commented Sep 18, 2014

ankurdave commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

ankurdave commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

ankurdave commented Sep 19, 2014