-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18147][SQL] do not fail for very complex aggregator result type #15807
Conversation
Sorry I haven't noticed that #15693 was merged. Then this PR becomes a cleanup, not a bug fix. But I'd like to keep the regression test as it's from another JIRA ticket. |
Test build #68331 has finished for PR 15807 at commit
|
Test build #68334 has finished for PR 15807 at commit
|
There is probably a bug of common subexpression elimination: we will evalute all subexpressions at the very beginning, no matter the results of subexpressions will be used or not. A counter example:
This may also be bad for performance, if a subexpression is expensive and won't be evaluted in most cases, but with subexpression elimination, we will always evaluate it. |
Not sure if this example is a real one or not. But looks like |
@viirya sorry I forgot to add one line code in the example. When |
@cloud-fan Yeah, looks like it is possibly the case. My first thought is seems not hard to solve this. I will look at this tomorrow. |
In general, in addition to bad performance, it may lead to an incorrect result if |
|
If we only evaluate the subexpressions before they are used. Wouldn't it cause more than once evaluation? E.g.,
If we evaluate |
@cloud-fan @kiszk I would propose to skip subexpression elimination for the expressions wrapped in condition expressions such as |
@viirya @cloud-fan It looks reasonable to me that to skip subexpression elimination for the expressions wrapped in condition expressions such as |
can we just evaluate subexpression like a scala lazy val? |
@cloud-fan Then once the first expression to use the subexpression is in a if/else branch, we can't access the subexpression outside later. Evaluate it again? |
I don't quite understand it, can you give an example? |
E.g.,
|
isn't the result of subexpression kept in member variables? What I am talking about is something like:
|
For non-wholestage codegen, yes. For wholestage codegen, no. |
why whole stage codegen can't use member variables to keep the result of subexpression? |
even we modify it to hold the results of subexpressions in member variables, the above code example should not work under wholestage codegen. The above code example is similar to non wholestage codegen subexpression elimination in fact, which is also using functions to wrap subexpression evaluations. Those functions take input row as parameter and evaluate subexpressions against the input row. But for wholestage codegen, as we might evaluate expressions against input row or local variables, the function approach can't work due to these local variables. |
For wholestage codegen, I think that a life time of sub-expressions is within an iteration for a row. Thus, |
+1 on @kiszk 's idea, the next problem is, the sub expr eval method may need local variables instead of input row
but doing this may lead to a lot of duplicated java codes, I think a better approach is to detect the local variables needed in the sub expr eval method and add them to parameters. |
@cloud-fan Looks good for now. I will take a look and give it a try tomorrow. |
As the subexpression elimination problem may be hard to fix, I only adds regression test in this PR, we can remove the |
@cloud-fan OK. I am looking at that issue. |
Test build #68436 has finished for PR 15807 at commit
|
## What changes were proposed in this pull request? ~In `TypedAggregateExpression.evaluateExpression`, we may create `ReferenceToExpressions` with `CreateStruct`, and `CreateStruct` may generate too many codes and split them into several methods. `ReferenceToExpressions` will replace `BoundReference` in `CreateStruct` with `LambdaVariable`, which can only be used as local variables and doesn't work if we split the generated code.~ It's already fixed by #15693 , this pr adds regression test ## How was this patch tested? new test in `DatasetAggregatorSuite` Author: Wenchen Fan <[email protected]> Closes #15807 from cloud-fan/typed-agg. (cherry picked from commit 6021c95) Signed-off-by: Wenchen Fan <[email protected]>
thanks for the review, merging to master/2.1 |
…tional expressions ## What changes were proposed in this pull request? As I pointed out in #15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed. Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. #15837 tries this approach, but it seems too complicated and may introduce performance regression. This PR simply stops common subexpression elimination for conditional expressions, with some cleanup. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes #16659 from cloud-fan/codegen.
## What changes were proposed in this pull request? ~In `TypedAggregateExpression.evaluateExpression`, we may create `ReferenceToExpressions` with `CreateStruct`, and `CreateStruct` may generate too many codes and split them into several methods. `ReferenceToExpressions` will replace `BoundReference` in `CreateStruct` with `LambdaVariable`, which can only be used as local variables and doesn't work if we split the generated code.~ It's already fixed by apache#15693 , this pr adds regression test ## How was this patch tested? new test in `DatasetAggregatorSuite` Author: Wenchen Fan <[email protected]> Closes apache#15807 from cloud-fan/typed-agg.
…tional expressions ## What changes were proposed in this pull request? As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed. Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression. This PR simply stops common subexpression elimination for conditional expressions, with some cleanup. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes apache#16659 from cloud-fan/codegen.
…tional expressions ## What changes were proposed in this pull request? As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed. Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression. This PR simply stops common subexpression elimination for conditional expressions, with some cleanup. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes apache#16659 from cloud-fan/codegen.
What changes were proposed in this pull request?
InTypedAggregateExpression.evaluateExpression
, we may createReferenceToExpressions
withCreateStruct
, andCreateStruct
may generate too many codes and split them into several methods.ReferenceToExpressions
will replaceBoundReference
inCreateStruct
withLambdaVariable
, which can only be used as local variables and doesn't work if we split the generated code.It's already fixed by #15693 , this pr adds regression test
How was this patch tested?
new test in
DatasetAggregatorSuite