-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22668][SQL] Ensure no global variables in arguments of method split by CodegenContext.splitExpressions() #20021
Conversation
Does this PR just check whether the condition of approach 2 occurs or not? If approach 2 does not replace with a temporary variable, assertion may occur. Am I wrong? |
Yea, I actually manually checked all the caller side of I also reverted some changes that tried to localize global variables as they are not needed now. |
Test build #85114 has finished for PR 20021 at commit
|
I see. I agree that to add assert helps all of developers. My question is what is a rule to decide whether a mutable state should be localized or not? I think that we have to ensure caller of |
if (Utils.isTesting) { | ||
// Passing global variables to the split method is dangerous, as any mutating to it is | ||
// ignored and may lead to unexpected behavior. | ||
val mutableStateNames = mutableStates.map(_._2).toSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after finally merging SPARK-18016, this should be the union of arrayCompactedMutableStates.flatMap(_.arrayNames)
and inlinedMutableStates.map(_._2)
, I think
…text.splitExpressions()
There is no rule as we don't need to localize global variables anymore. |
if (Utils.isTesting) { | ||
// Passing global variables to the split method is dangerous, as any mutating to it is | ||
// ignored and may lead to unexpected behavior. | ||
// We don't need to check `arrayCompactedMutableStates` here, as it results to array access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we declare a variable with the same name of an arrayCompactedMutableStates
? Let's say that we have:
public class Foo {
private Object[] ourArray;
// ....
private void ourMethod() {
Object[] ourArray = new Object[1];
ourSplitFunction(ourArray);
}
private void ourSplitFunction(Object[] ourArray) {
ourArray[0] = null;
}
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any place we would do that? ctx.addMutableState
returns an array access code, I can't image a caller would extract the array name from it and use it as parameters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but currently there is also no place which creates the problem for which this assertion is being introduces. Of course this case is very very unlikely, but since we are introducing the check, I think that the effort to ensure also this very remote corner case is very low...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a fix that avoids to pass a global variable to Would it be possible to share finding from your manual check at all the caller side of |
Test build #85112 has finished for PR 20021 at commit
|
Test build #85120 has finished for PR 20021 at commit
|
Test build #85121 has finished for PR 20021 at commit
|
@@ -930,6 +930,18 @@ class CodegenContext { | |||
// inline execution if only one block | |||
blocks.head | |||
} else { | |||
if (Utils.isTesting) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only do the assert in testing? Because passing global variables won't raise compile error, if we have any global variables passed in when not in testing, the codegen still work and may lead to wrong result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you said, it may lead, but likely it doesn't. Then I do think that the best option is to assert it only in testing, where this might help finding potential bugs. In production it is an overkill to throw an exception for a situation which most likely is not a problem IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a discussion about this testing.
You can also do this check. Look at the |
I checked some call sites. Here is one example that |
Test build #85150 has finished for PR 20021 at commit
|
Here is another example. This is complicated.
As this example points out, while WDYT?
|
Jenkins, retest this please |
Test build #85164 has finished for PR 20021 at commit
|
Hey, |
Oh, you are right. I misunderstood. After our optimizations, output is also a part of |
Honestly, I liked very much doing the test only for testing and not throwing an exception in production. IMHO it is an overkill to throw an exception in production and in the remote case that we happen to forget one place where this check can throw the exception, but it is not an issue, as it is perfectly possible, this would also cause a regression. Thus, honestly I am strongly against this solution. |
Well, I proposed to check it only for tests at the beginning, but I don't have a strong preference now as the new approach I took can guarantee that no place would violate it, by looking at all the caller sides of Anyway only checking it in tests is safer, WDYT @viirya ? |
Ok. I'm fine with only checking it in tests. |
Test build #85240 has finished for PR 20021 at commit
|
retest this please. |
LGTM |
LGTM too, thanks! |
Test build #85245 has finished for PR 20021 at commit
|
retest this please |
Test build #85262 has finished for PR 20021 at commit
|
thanks, merging to master! |
What changes were proposed in this pull request?
Passing global variables to the split method is dangerous, as any mutating to it is ignored and may lead to unexpected behavior.
To prevent this, one approach is to make sure no expression would output global variables: Localizing lifetime of mutable states in expressions.
Another approach is, when calling
ctx.splitExpression
, make sure we don't use children's output as parameter names.Approach 1 is actually hard to do, as we need to check all expressions and operators that support whole-stage codegen. Approach 2 is easier as the callers of
ctx.splitExpressions
are not too many.Besides, approach 2 is more flexible, as children's output may be other stuff that can't be parameter name: literal, inlined statement(a + 1), etc.
close #19865
close #19938
How was this patch tested?
existing tests