Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21657][SQL] optimize explode quadratic memory consumpation #19683

Closed
wants to merge 31 commits into from

Conversation

uzadude
Copy link
Contributor

@uzadude uzadude commented Nov 7, 2017

What changes were proposed in this pull request?

The issue has been raised in two Jira tickets: SPARK-21657, SPARK-16998. Basically, what happens is that in collection generators like explode/inline we create many rows from each row. Currently each exploded row contains also the column on which it was created. This causes, for example, if we have a 10k array in one row that this array will get copy 10k times - to each of the row. this results a qudratic memory consumption. However, it is a common case that the original column gets projected out after the explode, so we can avoid duplicating it.
In this solution we propose to identify this situation in the optimizer and turn on a flag for omitting the original column in the generation process.

How was this patch tested?

  1. We added a benchmark test to MiscBenchmark that shows x16 improvement in runtimes.
  2. We ran some of the other tests in MiscBenchmark and they show 15% improvements.
  3. We ran this code on a specific case from our production data with rows containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 mins.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83558 has finished for PR 19683 at commit ce7c369.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uzadude
Copy link
Contributor Author

uzadude commented Nov 7, 2017

I've fixed the styling issue.

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83559 has finished for PR 19683 at commit 76aa258.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uzadude
Copy link
Contributor Author

uzadude commented Nov 7, 2017

I've fixed more styling issue.

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83562 has finished for PR 19683 at commit 7a9bc96.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83566 has finished for PR 19683 at commit a3050e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2017

Test build #83587 has finished for PR 19683 at commit b8b5960.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uzadude
Copy link
Contributor Author

uzadude commented Nov 8, 2017

Do you understand this failure?

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83613 has finished for PR 19683 at commit b8b5960.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Tagar
Copy link

Tagar commented Nov 10, 2017

Can somebody please review this PR? Thanks.

} else {
generatorOutput
projectedChildOutput ++ generatorOutput
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need update indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@SparkQA
Copy link

SparkQA commented Nov 19, 2017

Test build #84001 has finished for PR 19683 at commit b825d6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uzadude
Copy link
Contributor Author

uzadude commented Nov 28, 2017

Hi, did somebody had a chance to look at this PR?
I think it's a pretty useful optimization.

@Tagar
Copy link

Tagar commented Nov 29, 2017

Looks like @kiszk already reviewed.
@gatorsmile, @cloud-fan would this be enough to commit this into Spark-2.3?
If not - would one of you please glance at this too?
It's not a huge patch, unlike its performance improvements which are tremendous.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small things but this isn't really my area, so will delegate to others.

@@ -450,6 +450,11 @@ object ColumnPruning extends Rule[LogicalPlan] {
case p @ Project(_, g: Generate) if g.join && p.references.subsetOf(g.generatedSet) =>
p.copy(child = g.copy(join = false))

// Turn on `omitGeneratorChild` for Generate if it's child column is not used
case p @ Project(_, g @ Generate(gu: UnaryExpression, true, _, false, _, _, _))
if (AttributeSet(Seq(gu.child)) -- p.references).nonEmpty =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.references.contains(gu.child)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't compile:
Type mismatch, expected: NamedExpression, actual: Expression

@@ -450,6 +450,11 @@ object ColumnPruning extends Rule[LogicalPlan] {
case p @ Project(_, g: Generate) if g.join && p.references.subsetOf(g.generatedSet) =>
p.copy(child = g.copy(join = false))

// Turn on `omitGeneratorChild` for Generate if it's child column is not used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: its

generatorOutput: Seq[Attribute],
child: SparkPlan)
extends UnaryExecNode with CodegenSupport {

private def projectedChildOutput = generator match {
case g: UnaryExpression if omitGeneratorChild =>
child.output diff Seq(g.child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .diff(...)

@Tagar
Copy link

Tagar commented Nov 30, 2017

Thanks @vanzin and @uzadude

Copy link
Contributor

@henryr henryr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at this a bit for my own edification. I wonder if a slightly more general approach would be to omit the UnsafeProject step in GenerateExec.doExecute if the generation is an immediate child of a projection?

That would save the copy entirely (rather than just the projected-out child column) since the projection is just going to redo the copy anyhow.

I don't know if SparkSQL has a more elegant way of expressing the constraint that output rows from a SparkPlan must be mutable, but it would be quite nice to be able to emit the JoinedRows from the generator and delay materialization of the combined tuples until needed.

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84595 has finished for PR 19683 at commit 04b5814.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uzadude
Copy link
Contributor Author

uzadude commented Dec 7, 2017

could you please retest this?

@uzadude
Copy link
Contributor Author

uzadude commented Dec 7, 2017

@henryr I understand what you're saying. I'm not sure why there is the UnsafeProject in the end of the function, but it's commented in this PR that fixes [SPARK-13476] without much elaboration.

@viirya
Copy link
Member

viirya commented Dec 29, 2017

I can reproduce the test failure locally.

@viirya
Copy link
Member

viirya commented Dec 29, 2017

The sql in the failed test org.apache.spark.sql.hive.execution.HiveUDFSuite.UDTF looks like:

SELECT udtf_count2(a) FROM (SELECT 1 AS a FROM src LIMIT 3) t

The unrequiredChildOutput is ArrayBuffer(a#1287) first. But as it is a literal in fact, the unrequiredChildOutput is optimized to ArrayBuffer(1) later.

@cloud-fan
Copy link
Contributor

Now I feel it's a little hacky to introduce Generate.unrequiredChildOuput, as the attribute may get replaced by something else during optimization. How about Generate.unreqiredChildIndex?

@Tagar
Copy link

Tagar commented Dec 29, 2017

There was a similar exception as in failing unit tests was fixed in SPARK-18300

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.Attribute

#15892

Not sure if this is directly applicable or helpful here though.

@uzadude
Copy link
Contributor Author

uzadude commented Dec 29, 2017

seems reasonable, let's do that.

outer: Boolean,
qualifier: Option[String],
generatorOutput: Seq[Attribute],
child: LogicalPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong indentation?

@@ -57,20 +62,19 @@ private[execution] sealed case class LazyIterator(func: () => TraversableOnce[In
*/
case class GenerateExec(
generator: Generator,
join: Boolean,
unrequiredChildIndex: Seq[Int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The physical plan can just take requiredChildOutput, and in the planner we can just do

case g @ logical.Generate(...) => GenerateExec(..., g.requiredChildOutput)

@cloud-fan
Copy link
Contributor

LGTM except 2 comments

@@ -47,8 +47,13 @@ private[execution] sealed case class LazyIterator(func: () => TraversableOnce[In
* terminate().
*
* @param generator the generator expression
* @param join when true, each output row is implicitly joined with the input tuple that produced
* it.
* @param requiredChildOutput this paramter starts as Nil and gets filled by the Optimizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to duplicate the comment here, just say required attributes from child output

val rows = if (requiredChildOutput.nonEmpty) {

val pruneChildForResult: InternalRow => InternalRow =
if ((child.outputSet -- requiredChildOutput).isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just child.output == requiredChildOutput?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it always return false? or should I use child.output == AttributeSet(requiredChildOutput)

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85501 has finished for PR 19683 at commit 8f06dda.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85500 has finished for PR 19683 at commit 288aa73.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85502 has finished for PR 19683 at commit 1c6626a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Dec 29, 2017

retest this please.

* A common use case is when we explode(array(..)) and are interested
* only in the exploded data and not in the original array. before this
* optimization the array got duplicated for each of its elements,
* causing O(n^^2) memory consumption. (see [SPARK-21657])
Copy link
Member

@viirya viirya Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems an extra space at the beginning since 2nd line.

@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.benchmark

import org.apache.spark.util.Benchmark


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary blank line.


}

/*
Copy link
Member

@viirya viirya Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark result should be moved up to be included in the ignored test block, as other tests.

@viirya
Copy link
Member

viirya commented Dec 29, 2017

Three comments for style. LGTM.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85503 has finished for PR 19683 at commit 1c6626a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uzadude
Copy link
Contributor Author

uzadude commented Dec 29, 2017

this timeout in "org.apache.spark.ml.regression.LinearRegressionSuite.linear regression with intercept without regularization" doesn't seem related to our fix..

@uzadude
Copy link
Contributor Author

uzadude commented Dec 29, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85504 has finished for PR 19683 at commit 4edd884.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master, great work!

@asfgit asfgit closed this in fcf66a3 Dec 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants