[SPARK-19426][SQL] Custom coalesce for Dataset #16766

mariusvniekerk · 2017-02-01T16:39:56Z

What changes were proposed in this pull request?

This adds support for using the PartitionCoalescer features added in #11865 (SPARK-14042) to the Dataset API

How was this patch tested?

Manual tests

gatorsmile · 2017-02-02T07:33:34Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+  * Returns a new RDD that has exactly `numPartitions` partitions.
+  */
+case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer],
+                    child: LogicalPlan)


Could you follow the styles documented in https://github.com/databricks/scala-style-guide?

gatorsmile · 2017-02-02T07:46:55Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
 }

 /**
+  * Returns a new RDD that has exactly `numPartitions` partitions.
+  */
+case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer],


CoalesceLogical -> Coalesce?

Main reason is there was already a Coalesce expression class

gatorsmile · 2017-02-02T07:48:49Z

Could you please also add a few test cases? For example, DataFrameSuite or DatasetSuite.

felixcheung · 2017-02-02T17:41:11Z

I'd second that. I'd be interested to know if this implementation changes behavior for coalesce

gatorsmile · 2017-02-03T06:56:49Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      CoalesceLogical(numPartitions, partitionCoalescer, logicalPlan)
+    }
+
+  def coalesce(numPartitions: Int): Dataset[T] = coalesce(numPartitions, None)


Please also add the function description, like what we did in the other functions in Dataset.scala?

gatorsmile · 2017-02-03T06:57:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

-case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode {
+case class CoalesceExec(numPartitions: Int, child: SparkPlan,
+                        partitionCoalescer: Option[PartitionCoalescer]
+                       ) extends UnaryExecNode {


The same indent issue here.

case class CoalesceExec( numPartitions: Int, child: SparkPlan, partitionCoalescer: Option[PartitionCoalescer]) extends UnaryExecNode {

Do you guys have a .scalafmt.conf that applies all of this? that should make things cleaner.

gatorsmile · 2017-02-03T07:05:39Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

 import org.apache.spark.sql.catalyst.{CatalystConf, TableIdentifier}
+import scala.collection.mutable.ArrayBuffer


gatorsmile · 2017-02-03T07:12:05Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
 }

 /**
+ * Returns a new RDD that has exactly `numPartitions` partitions.
+ */
+case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer],


The name still looks inconsistent with the others. How about PartitionCoalesce?

that sounds good

gatorsmile · 2017-02-03T07:12:24Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
 }

 /**
+ * Returns a new RDD that has exactly `numPartitions` partitions.


This description is not right.

gatorsmile · 2017-02-03T07:12:44Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+ */
+case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer],
+                    child: LogicalPlan)
+  extends UnaryNode {


Style issues:

case class PartitionCoalesce( numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer], child: LogicalPlan) extends UnaryNode {

gatorsmile · 2017-02-03T07:14:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

@@ -19,9 +19,8 @@ package org.apache.spark.sql.execution

 import scala.concurrent.{ExecutionContext, Future}
 import scala.concurrent.duration.Duration
-


Add it back?

gatorsmile · 2017-02-03T07:19:33Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+    val data = (1 to 1000).map(i => ClassData(i.toString, i))
+    data.toDS().repartition(10).write.format("csv").save(path.toString)
+
+    val ds = spark.read.format("csv").load(path.toString).as[ClassData]


cannot resolve '`a`' given input columns: [_c0, _c1];

Oh right csv doesn't do headers.

gatorsmile · 2017-02-03T07:23:48Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+    // Similar to the implementation of `test("custom RDD coalescer")` from [[RDDSuite]] we first
+    // write out to disk, to ensure that our splits are in fact [[FileSplit]] instances.
+    val data = (1 to 1000).map(i => ClassData(i.toString, i))
+    data.toDS().repartition(10).write.format("csv").save(path.toString)


use WithTempPath to generate the path?

gatorsmile · 2017-02-03T07:25:44Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+
+  after {
+    Utils.deleteRecursively(path)
+  }


No need to do it, if you use withTempPath. This is an example

ah thanks. I looked at the writer tests

gatorsmile · 2017-02-03T07:26:14Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+      assert(splitSizeSum <= maxSplitSize)
+    })
+    assert(totalPartitionCount == 10)
+


Nit: Remove this empty line.

mariusvniekerk · 2017-02-03T16:23:50Z

@felixcheung This does not touch any of the coalesce internals. Only allows setting a partitionCoalescer similar to what is already available in rdd.coalesce

gatorsmile · 2017-06-13T06:23:19Z

Sorry for the late reply. @mariusvniekerk

Could you please update the PR?

gatorsmile · 2017-06-13T06:23:28Z

ok to test

SparkQA · 2017-06-13T06:29:11Z

Test build #77969 has finished for PR 16766 at commit d4bde0b.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

mariusvniekerk · 2017-06-13T23:14:05Z

Let me rebase this. I don't currently have a clean way of testing this on Windows

gatorsmile · 2017-06-14T06:43:25Z

Could you run the following four commands to do a local test in your local environment?

dev/lint-scala
build/sbt -Phive hive/test
build/sbt sql/test
build/sbt catalyst/test

SparkQA · 2017-06-17T15:19:11Z

Test build #78212 has finished for PR 16766 at commit 00b2a7a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CoalesceExec(

SparkQA · 2017-06-17T15:48:47Z

Test build #78213 has finished for PR 16766 at commit e99cebc.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-17T22:09:27Z

Test build #78218 has finished for PR 16766 at commit 1d1b2fa.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataSetSizeBasedPartitionCoalescer(maxSize: Int) extends

HyukjinKwon

Actual javadoc errors are as below:

[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/Dataset.java:2222: error: reference not found
[error]    * A {@link PartitionCoalescer} can also be supplied allowing the behavior of the partitioning to be
[error]               ^
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/Dataset.java:2223: error: reference not found
[error]    * customized similar to {@link RDD.coalesce}.
[error]

HyukjinKwon · 2017-06-19T04:31:10Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -2603,12 +2603,27 @@ class Dataset[T] private[sql](
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
+   * A [[PartitionCoalescer]] can also be supplied allowing the behavior of the partitioning to be
+   * customized similar to [[RDD.coalesce]].


I think it should be [[org.apache.spark.rdd.RDD##coalesce]].

HyukjinKwon · 2017-06-19T04:32:08Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -2603,12 +2603,27 @@ class Dataset[T] private[sql](
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
+   * A [[PartitionCoalescer]] can also be supplied allowing the behavior of the partitioning to be


Sounds this trait is unable to be generated as is in Java. Simply wrapping `...` would be fine.

HyukjinKwon · 2017-07-24T03:22:55Z

Hi @mariusvniekerk, would you be able to fix the javadoc errors?

gatorsmile · 2017-08-06T06:12:41Z

cc @maropu Do you want to take this over?

maropu · 2017-08-06T07:52:47Z

@gatorsmile Sure! I'll do, Thanks!

gatorsmile reviewed Feb 2, 2017

View reviewed changes

gatorsmile reviewed Feb 3, 2017

View reviewed changes

mariusvniekerk added 4 commits June 17, 2017 11:12

custom coalesce

0f1060c

Adapted unit test for custom coalesce from the one in RDDSuite

2742457

Addressing review comments.

2499364

Indentation fixup

00b2a7a

mariusvniekerk force-pushed the wip_customCoalesce branch from d4bde0b to 00b2a7a Compare June 17, 2017 15:15

Fixed style issues

e99cebc

Fixed failing tests

1d1b2fa

HyukjinKwon reviewed Jun 19, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

maropu mentioned this pull request Aug 6, 2017

[SPARK-19426][SQL] Custom coalescer for Dataset #18861

Closed

		import org.apache.spark.sql.catalyst.{CatalystConf, TableIdentifier}
		import scala.collection.mutable.ArrayBuffer

		@@ -19,9 +19,8 @@ package org.apache.spark.sql.execution

		import scala.concurrent.{ExecutionContext, Future}
		import scala.concurrent.duration.Duration

[SPARK-19426][SQL] Custom coalesce for Dataset #16766

[SPARK-19426][SQL] Custom coalesce for Dataset #16766

Conversation

mariusvniekerk commented Feb 1, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 2, 2017 • edited Loading

felixcheung commented Feb 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Feb 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Feb 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariusvniekerk commented Feb 3, 2017

gatorsmile commented Jun 13, 2017

gatorsmile commented Jun 13, 2017

SparkQA commented Jun 13, 2017

mariusvniekerk commented Jun 13, 2017

gatorsmile commented Jun 14, 2017

SparkQA commented Jun 17, 2017

SparkQA commented Jun 17, 2017

SparkQA commented Jun 17, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 24, 2017

gatorsmile commented Aug 6, 2017

maropu commented Aug 6, 2017

gatorsmile commented Feb 2, 2017 •

edited

Loading

gatorsmile Feb 3, 2017 •

edited

Loading

gatorsmile Feb 3, 2017 •

edited

Loading