[SPARK-21657][SQL] optimize explode quadratic memory consumpation #19683

uzadude · 2017-11-07T11:53:42Z

What changes were proposed in this pull request?

The issue has been raised in two Jira tickets: SPARK-21657, SPARK-16998. Basically, what happens is that in collection generators like explode/inline we create many rows from each row. Currently each exploded row contains also the column on which it was created. This causes, for example, if we have a 10k array in one row that this array will get copy 10k times - to each of the row. this results a qudratic memory consumption. However, it is a common case that the original column gets projected out after the explode, so we can avoid duplicating it.
In this solution we propose to identify this situation in the optimizer and turn on a flag for omitting the original column in the generation process.

How was this patch tested?

We added a benchmark test to MiscBenchmark that shows x16 improvement in runtimes.
We ran some of the other tests in MiscBenchmark and they show 15% improvements.
We ran this code on a specific case from our production data with rows containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 mins.

gatorsmile · 2017-11-07T18:25:48Z

ok to test

SparkQA · 2017-11-07T18:29:32Z

Test build #83558 has finished for PR 19683 at commit ce7c369.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

uzadude · 2017-11-07T18:36:43Z

I've fixed the styling issue.

SparkQA · 2017-11-07T18:39:27Z

Test build #83559 has finished for PR 19683 at commit 76aa258.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

uzadude · 2017-11-07T18:47:46Z

I've fixed more styling issue.

SparkQA · 2017-11-07T19:12:06Z

Test build #83562 has finished for PR 19683 at commit 7a9bc96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-07T21:38:32Z

Test build #83566 has finished for PR 19683 at commit a3050e9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-08T08:05:01Z

Test build #83587 has finished for PR 19683 at commit b8b5960.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

uzadude · 2017-11-08T08:59:52Z

Do you understand this failure?

cloud-fan · 2017-11-08T23:16:15Z

retest this please

SparkQA · 2017-11-09T02:14:17Z

Test build #83613 has finished for PR 19683 at commit b8b5960.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Tagar · 2017-11-10T18:27:10Z

Can somebody please review this PR? Thanks.

kiszk · 2017-11-19T06:31:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala

-    } else {
-      generatorOutput
+      projectedChildOutput ++ generatorOutput
+      } else {


nit: do we need update indentation?

SparkQA · 2017-11-19T18:15:55Z

Test build #84001 has finished for PR 19683 at commit b825d6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uzadude · 2017-11-28T19:23:08Z

Hi, did somebody had a chance to look at this PR?
I think it's a pretty useful optimization.

Tagar · 2017-11-29T22:43:46Z

Looks like @kiszk already reviewed.
@gatorsmile, @cloud-fan would this be enough to commit this into Spark-2.3?
If not - would one of you please glance at this too?
It's not a huge patch, unlike its performance improvements which are tremendous.

vanzin

A few small things but this isn't really my area, so will delegate to others.

vanzin · 2017-11-29T23:19:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -450,6 +450,11 @@ object ColumnPruning extends Rule[LogicalPlan] {
    case p @ Project(_, g: Generate) if g.join && p.references.subsetOf(g.generatedSet) =>
      p.copy(child = g.copy(join = false))

+    // Turn on `omitGeneratorChild` for Generate if it's child column is not used
+    case p @ Project(_, g @ Generate(gu: UnaryExpression, true, _, false, _, _, _))
+      if (AttributeSet(Seq(gu.child)) -- p.references).nonEmpty =>


p.references.contains(gu.child)?

doesn't compile:
Type mismatch, expected: NamedExpression, actual: Expression

vanzin · 2017-11-29T23:20:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -450,6 +450,11 @@ object ColumnPruning extends Rule[LogicalPlan] {
    case p @ Project(_, g: Generate) if g.join && p.references.subsetOf(g.generatedSet) =>
      p.copy(child = g.copy(join = false))

+    // Turn on `omitGeneratorChild` for Generate if it's child column is not used


vanzin · 2017-11-29T23:24:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala

    generatorOutput: Seq[Attribute],
    child: SparkPlan)
  extends UnaryExecNode with CodegenSupport {

+  private def projectedChildOutput = generator match {
+    case g: UnaryExpression if omitGeneratorChild =>
+      child.output diff Seq(g.child)


nit: .diff(...)

Tagar · 2017-11-30T15:30:21Z

Thanks @vanzin and @uzadude

henryr

I've been looking at this a bit for my own edification. I wonder if a slightly more general approach would be to omit the UnsafeProject step in GenerateExec.doExecute if the generation is an immediate child of a projection?

That would save the copy entirely (rather than just the projected-out child column) since the projection is just going to redo the copy anyhow.

I don't know if SparkSQL has a more elegant way of expressing the constraint that output rows from a SparkPlan must be mutable, but it would be quite nice to be able to emit the JoinedRows from the generator and delay materialization of the combined tuples until needed.

SparkQA · 2017-12-07T08:05:01Z

Test build #84595 has finished for PR 19683 at commit 04b5814.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

uzadude · 2017-12-07T09:29:55Z

could you please retest this?

uzadude · 2017-12-07T19:39:07Z

@henryr I understand what you're saying. I'm not sure why there is the UnsafeProject in the end of the function, but it's commented in this PR that fixes [SPARK-13476] without much elaboration.

viirya · 2017-12-29T02:01:04Z

I can reproduce the test failure locally.

viirya · 2017-12-29T02:03:58Z

The sql in the failed test org.apache.spark.sql.hive.execution.HiveUDFSuite.UDTF looks like:

SELECT udtf_count2(a) FROM (SELECT 1 AS a FROM src LIMIT 3) t

The unrequiredChildOutput is ArrayBuffer(a#1287) first. But as it is a literal in fact, the unrequiredChildOutput is optimized to ArrayBuffer(1) later.

cloud-fan · 2017-12-29T06:00:03Z

Now I feel it's a little hacky to introduce Generate.unrequiredChildOuput, as the attribute may get replaced by something else during optimization. How about Generate.unreqiredChildIndex?

Tagar · 2017-12-29T06:26:17Z

There was a similar exception as in failing unit tests was fixed in SPARK-18300

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.Attribute

#15892

Not sure if this is directly applicable or helpful here though.

uzadude · 2017-12-29T06:53:10Z

seems reasonable, let's do that.

cloud-fan · 2017-12-29T07:16:16Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+     outer: Boolean,
+     qualifier: Option[String],
+     generatorOutput: Seq[Attribute],
+     child: LogicalPlan)


wrong indentation?

cloud-fan · 2017-12-29T07:21:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala

@@ -57,20 +62,19 @@ private[execution] sealed case class LazyIterator(func: () => TraversableOnce[In
 */
 case class GenerateExec(
    generator: Generator,
-    join: Boolean,
+    unrequiredChildIndex: Seq[Int],


The physical plan can just take requiredChildOutput, and in the planner we can just do

case g @ logical.Generate(...) => GenerateExec(..., g.requiredChildOutput)

cloud-fan · 2017-12-29T07:23:06Z

LGTM except 2 comments

cloud-fan · 2017-12-29T07:43:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala

@@ -47,8 +47,13 @@ private[execution] sealed case class LazyIterator(func: () => TraversableOnce[In
 * terminate().
 *
 * @param generator the generator expression
- * @param join  when true, each output row is implicitly joined with the input tuple that produced
- *              it.
+ * @param requiredChildOutput this paramter starts as Nil and gets filled by the Optimizer.


we don't need to duplicate the comment here, just say required attributes from child output

cloud-fan · 2017-12-29T07:44:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala

+      val rows = if (requiredChildOutput.nonEmpty) {
+
+        val pruneChildForResult: InternalRow => InternalRow =
+          if ((child.outputSet -- requiredChildOutput).isEmpty) {


just child.output == requiredChildOutput?

wouldn't it always return false? or should I use child.output == AttributeSet(requiredChildOutput)

SparkQA · 2017-12-29T08:05:01Z

Test build #85501 has finished for PR 19683 at commit 8f06dda.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-29T08:05:01Z

Test build #85500 has finished for PR 19683 at commit 288aa73.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-29T08:05:02Z

Test build #85502 has finished for PR 19683 at commit 1c6626a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-29T08:12:31Z

retest this please.

viirya · 2017-12-29T08:18:34Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+ *                              A common use case is when we explode(array(..)) and are interested
+ *                              only in the exploded data and not in the original array. before this
+ *                              optimization the array got duplicated for each of its elements,
+ *                              causing O(n^^2) memory consumption. (see [SPARK-21657])


nit: seems an extra space at the beginning since 2nd line.

viirya · 2017-12-29T08:19:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala

@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.benchmark

 import org.apache.spark.util.Benchmark

+


nit: unnecessary blank line.

viirya · 2017-12-29T08:21:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala

+
+  }
+
+    /*


This benchmark result should be moved up to be included in the ignored test block, as other tests.

viirya · 2017-12-29T08:22:46Z

Three comments for style. LGTM.

SparkQA · 2017-12-29T09:02:22Z

Test build #85503 has finished for PR 19683 at commit 1c6626a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

uzadude · 2017-12-29T10:29:37Z

this timeout in "org.apache.spark.ml.regression.LinearRegressionSuite.linear regression with intercept without regularization" doesn't seem related to our fix..

uzadude · 2017-12-29T10:29:54Z

retest this please

SparkQA · 2017-12-29T11:49:43Z

Test build #85504 has finished for PR 19683 at commit 4edd884.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-29T13:08:59Z

thanks, merging to master, great work!

[SPARK-21657][SQL] optimize explode quadratic memory consumpation

ce7c369

fixed "File line length exceeds 100 characters"

76aa258

fixed "File line length exceeds 100 characters"

7a9bc96

Fixed ClassCastException

a3050e9

fixed ColumnPruningSuite join plan test

b8b5960

kiszk reviewed Nov 19, 2017

View reviewed changes

changed indentation

b825d6b

vanzin reviewed Nov 29, 2017

View reviewed changes

henryr reviewed Dec 7, 2017

View reviewed changes

Merge branch 'master' into optimize_explode

04b5814

nit change

7cb9454

unrequiredChildIndex: Seq[Attribute] -> unrequiredChildIndex: Seq[Int]

288aa73

cloud-fan reviewed Dec 29, 2017

View reviewed changes

fixed indentation

283340f

cloud-fan reviewed Dec 29, 2017

View reviewed changes

last review changes.

8f06dda

cloud-fan reviewed Dec 29, 2017

View reviewed changes

small fixes

1c6626a

viirya reviewed Dec 29, 2017

View reviewed changes

fixed comments by @viirya

4edd884

asfgit closed this in fcf66a3 Dec 29, 2017

		@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.benchmark

		import org.apache.spark.util.Benchmark

[SPARK-21657][SQL] optimize explode quadratic memory consumpation #19683

[SPARK-21657][SQL] optimize explode quadratic memory consumpation #19683

Conversation

uzadude commented Nov 7, 2017

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Nov 7, 2017

SparkQA commented Nov 7, 2017

uzadude commented Nov 7, 2017

SparkQA commented Nov 7, 2017

uzadude commented Nov 7, 2017

SparkQA commented Nov 7, 2017

SparkQA commented Nov 7, 2017

SparkQA commented Nov 8, 2017

uzadude commented Nov 8, 2017

cloud-fan commented Nov 8, 2017

SparkQA commented Nov 9, 2017

Tagar commented Nov 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 19, 2017

uzadude commented Nov 28, 2017

Tagar commented Nov 29, 2017

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tagar commented Nov 30, 2017

henryr left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 7, 2017

uzadude commented Dec 7, 2017

uzadude commented Dec 7, 2017

viirya commented Dec 29, 2017

viirya commented Dec 29, 2017

cloud-fan commented Dec 29, 2017

Tagar commented Dec 29, 2017

uzadude commented Dec 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 29, 2017

SparkQA commented Dec 29, 2017

SparkQA commented Dec 29, 2017

viirya commented Dec 29, 2017

viirya Dec 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Dec 29, 2017 • edited Loading

Choose a reason for hiding this comment

viirya commented Dec 29, 2017

SparkQA commented Dec 29, 2017

uzadude commented Dec 29, 2017

uzadude commented Dec 29, 2017

SparkQA commented Dec 29, 2017

cloud-fan commented Dec 29, 2017

viirya Dec 29, 2017 •

edited

Loading

viirya Dec 29, 2017 •

edited

Loading