Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18147][SQL] do not fail for very complex aggregator result type #15807

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 8, 2016

What changes were proposed in this pull request?

In TypedAggregateExpression.evaluateExpression, we may create ReferenceToExpressions with CreateStruct, and CreateStruct may generate too many codes and split them into several methods. ReferenceToExpressions will replace BoundReference in CreateStruct with LambdaVariable, which can only be used as local variables and doesn't work if we split the generated code.

It's already fixed by #15693 , this pr adds regression test

How was this patch tested?

new test in DatasetAggregatorSuite

@cloud-fan
Copy link
Contributor Author

Sorry I haven't noticed that #15693 was merged. Then this PR becomes a cleanup, not a bug fix. But I'd like to keep the regression test as it's from another JIRA ticket.

cc @hvanhovell @viirya @kiszk @davies

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68331 has finished for PR 15807 at commit 3917630.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class ComplexAggData(d1: AggData, d2: AggData)

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68334 has finished for PR 15807 at commit 0103bb4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Nov 8, 2016

There is probably a bug of common subexpression elimination: we will evalute all subexpressions at the very beginning, no matter the results of subexpressions will be used or not. A counter example:

if (isNull(subexpr)) {
  ...
} else {
  AssertNotNull(subexpr)  // subexpr2
  ....
  SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
}

AssertNotNull(subexpr) // subexpr2 should only be evaluated when subexpr is not null, however, with subexpression elimination, we will always evaluate it.

This may also be bad for performance, if a subexpression is expensive and won't be evaluted in most cases, but with subexpression elimination, we will always evaluate it.
cc @davies

@viirya
Copy link
Member

viirya commented Nov 8, 2016

Not sure if this example is a real one or not. But looks like subexpr is the subexpression, not assertNotNull(subexpr). Why assertNotNull(subexpr) will be always evaluated?

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Nov 8, 2016

@viirya sorry I forgot to add one line code in the example. When AsertNotNull(subexpr) is also a subexpression, we will always evaluate it.

@viirya
Copy link
Member

viirya commented Nov 8, 2016

@cloud-fan Yeah, looks like it is possibly the case. My first thought is seems not hard to solve this. I will look at this tomorrow.

@kiszk
Copy link
Member

kiszk commented Nov 8, 2016

In general, in addition to bad performance, it may lead to an incorrect result if subexpr2 is always evaluated. When subexpr is null, to evaluate subexpr2 may throw an exception.
A compiler has to carefully move statements beyond a conditional branch.

@viirya
Copy link
Member

viirya commented Nov 9, 2016

subexpr2 is a special case as it is an AssertNotNull which is a NonSQLExpression and can throw an exception at runtime. If I understand correctly, a SQL expression should not throw an exception like that. Running subexpressions for pure SQL expressions should be a performance issue only, I guess.

I would a bit prefer to disallow subexpression elimination for a NonSQLExpression. Looks like it is easy to cause not excepted result and exception.

@viirya
Copy link
Member

viirya commented Nov 9, 2016

If we only evaluate the subexpressions before they are used. Wouldn't it cause more than once evaluation?

E.g.,

if (isNull(subexpr)) {
  ...
} else {
  AssertNotNull(subexpr)  // subexpr2
  ....
  SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
}

if (isNull(subexpr)) {
  ...
} else {
  AssertNotNull(subexpr)  // subexpr2
  ....
  SomeExpr2(AssertNotNull(subexpr)) // SomeExpr2(subexpr2)
}   

SomeExpr3(AssertNotNull(subexpr)) // SomeExpr3(subexpr2)

If we evaluate subexpr2 before it is used, we might evaluate it three times at the end.

@viirya
Copy link
Member

viirya commented Nov 9, 2016

@cloud-fan @kiszk I would propose to skip subexpression elimination for the expressions wrapped in condition expressions such as If. What do you think?

@kiszk
Copy link
Member

kiszk commented Nov 9, 2016

@viirya @cloud-fan It looks reasonable to me that to skip subexpression elimination for the expressions wrapped in condition expressions such as if. This is because we have only a place at top level to put common subexpression.

@cloud-fan
Copy link
Contributor Author

can we just evaluate subexpression like a scala lazy val?

@viirya
Copy link
Member

viirya commented Nov 9, 2016

@cloud-fan Then once the first expression to use the subexpression is in a if/else branch, we can't access the subexpression outside later. Evaluate it again?

@cloud-fan
Copy link
Contributor Author

we can't access the subexpression outside later

I don't quite understand it, can you give an example?

@viirya
Copy link
Member

viirya commented Nov 9, 2016

E.g.,

if (isNull(subexpr)) {
  ...
} else {
  AssertNotNull(subexpr)  // subexpr2, first used.
  ....
  SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
}

if (isNull(subexpr)) {
  ...
} else {
  AssertNotNull(subexpr)  // subexpr2
  ....
  SomeExpr2(AssertNotNull(subexpr)) // SomeExpr2(subexpr2)
}   

SomeExpr3(AssertNotNull(subexpr)) // SomeExpr3(subexpr2)

AssertNotNull(subexpr) is recognized as subexpression. When it is used in the else branch at the top, it is evaluated in this branch. But SomeExpr3(AssertNotNull(subexpr)) which also uses it can't access the evaluated subexpression result.

@cloud-fan
Copy link
Contributor Author

isn't the result of subexpression kept in member variables? What I am talking about is something like:

private boolean isInitialized = false;
private Int subExpr1 = 0;

private void evalSubExpr1() {
  ...
  isInitialized = true
}

public int getSubExpr() {
  if (!isInitialized) {
    evalSubExpr1()
  }
  subExpr1
}

@viirya
Copy link
Member

viirya commented Nov 9, 2016

isn't the result of subexpression kept in member variables?

For non-wholestage codegen, yes. For wholestage codegen, no.

@cloud-fan
Copy link
Contributor Author

why whole stage codegen can't use member variables to keep the result of subexpression?

@viirya
Copy link
Member

viirya commented Nov 9, 2016

even we modify it to hold the results of subexpressions in member variables, the above code example should not work under wholestage codegen.

The above code example is similar to non wholestage codegen subexpression elimination in fact, which is also using functions to wrap subexpression evaluations. Those functions take input row as parameter and evaluate subexpressions against the input row.

But for wholestage codegen, as we might evaluate expressions against input row or local variables, the function approach can't work due to these local variables.

@kiszk
Copy link
Member

kiszk commented Nov 9, 2016

For wholestage codegen, I think that a life time of sub-expressions is within an iteration for a row. Thus, isInitialized and subExpr1 should be initialized at the beginning of each iteration. For example, after reading a row.

@cloud-fan
Copy link
Contributor Author

+1 on @kiszk 's idea, the next problem is, the sub expr eval method may need local variables instead of input row i, but we can inline it at every place which need the result of subexpression, e.g.

if () {
  ...
  if (!isInitialized) {
    ... // code to eval subExpr1
    isInitialized = true
  }
  ...
} else {
  if (!isInitialized) {
    ... // code to eval subExpr1
    isInitialized = true
  }
}

but doing this may lead to a lot of duplicated java codes, I think a better approach is to detect the local variables needed in the sub expr eval method and add them to parameters.

@viirya
Copy link
Member

viirya commented Nov 9, 2016

@cloud-fan Looks good for now. I will take a look and give it a try tomorrow.

@cloud-fan
Copy link
Contributor Author

As the subexpression elimination problem may be hard to fix, I only adds regression test in this PR, we can remove the ReferenceToExpressions later.

@viirya
Copy link
Member

viirya commented Nov 10, 2016

@cloud-fan OK. I am looking at that issue.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68436 has finished for PR 15807 at commit 3315d10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ComplexAggData(d1: AggData, d2: AggData)

asfgit pushed a commit that referenced this pull request Nov 10, 2016
## What changes were proposed in this pull request?

~In `TypedAggregateExpression.evaluateExpression`, we may create `ReferenceToExpressions` with `CreateStruct`, and `CreateStruct` may generate too many codes and split them into several methods.  `ReferenceToExpressions` will replace `BoundReference` in `CreateStruct` with `LambdaVariable`, which can only be used as local variables and doesn't work if we split the generated code.~

It's already fixed by #15693 , this pr adds regression test

## How was this patch tested?

new test in `DatasetAggregatorSuite`

Author: Wenchen Fan <[email protected]>

Closes #15807 from cloud-fan/typed-agg.

(cherry picked from commit 6021c95)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master/2.1

@asfgit asfgit closed this in 6021c95 Nov 10, 2016
asfgit pushed a commit that referenced this pull request Jan 23, 2017
…tional expressions

## What changes were proposed in this pull request?

As I pointed out in #15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed.

Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. #15837 tries this approach, but it seems too complicated and may introduce performance regression.

This PR simply stops common subexpression elimination for conditional expressions, with some cleanup.

## How was this patch tested?

regression test

Author: Wenchen Fan <[email protected]>

Closes #16659 from cloud-fan/codegen.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

~In `TypedAggregateExpression.evaluateExpression`, we may create `ReferenceToExpressions` with `CreateStruct`, and `CreateStruct` may generate too many codes and split them into several methods.  `ReferenceToExpressions` will replace `BoundReference` in `CreateStruct` with `LambdaVariable`, which can only be used as local variables and doesn't work if we split the generated code.~

It's already fixed by apache#15693 , this pr adds regression test

## How was this patch tested?

new test in `DatasetAggregatorSuite`

Author: Wenchen Fan <[email protected]>

Closes apache#15807 from cloud-fan/typed-agg.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…tional expressions

## What changes were proposed in this pull request?

As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed.

Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression.

This PR simply stops common subexpression elimination for conditional expressions, with some cleanup.

## How was this patch tested?

regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16659 from cloud-fan/codegen.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…tional expressions

## What changes were proposed in this pull request?

As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed.

Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression.

This PR simply stops common subexpression elimination for conditional expressions, with some cleanup.

## How was this patch tested?

regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16659 from cloud-fan/codegen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants