-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection #20637
Conversation
Test build #87544 has finished for PR 20637 at commit
|
retest this please |
Test build #87548 has finished for PR 20637 at commit
|
retest this please |
Test build #87549 has finished for PR 20637 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a random small comment. It seems reasonable though I don't know this code well.
val numVarLenFields = exprTypes.count { | ||
case dt if UnsafeRow.isFixedLength(dt) => false | ||
val numVarLenFields = exprTypeAndNullables.count { | ||
case (dt, _) if UnsafeRow.isFixedLength(dt) => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is .count { case (dt, _) => !UnsafeRow.isFixedLength(dt) }
more straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one comment, otherwise seems ok to me
@@ -142,7 +143,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro | |||
case _ => s"$rowWriter.write($index, ${input.value});" | |||
} | |||
|
|||
if (input.isNull == "false") { | |||
if (input.isNull == "false" || !nullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't those checks equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thanks
retest this please |
@@ -70,7 +72,8 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro | |||
| // Remember the current cursor so that we can calculate how many bytes are | |||
| // written later. | |||
| final int $previousCursor = $rowWriter.cursor(); | |||
| ${writeExpressionsToBuffer(ctx, tmpInput, fieldEvals, fieldTypes, structRowWriter)} | |||
| ${writeExpressionsToBuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in situations like this I have been told to move the function call before and assign it to a variable
val elementAssignment = if (elementNullable) { | ||
s""" | ||
|if ($tmpInput.isNullAt($index)) { | ||
| $arrayWriter.setNull$primitiveTypeName($index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not using .setNull${elementOrOffsetSize}Bytes
as it was before?
@@ -219,15 +235,17 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro | |||
| // Remember the current cursor so that we can write numBytes of key array later. | |||
| final int $tmpCursor = $rowWriter.cursor(); | |||
| | |||
| ${writeArrayToBuffer(ctx, s"$tmpInput.keyArray()", keyType, rowWriter)} | |||
| ${writeArrayToBuffer( | |||
ctx, s"$tmpInput.keyArray()", keyType, false, rowWriter)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be on one line right?
Test build #94445 has finished for PR 20637 at commit
|
Test build #94464 has finished for PR 20637 at commit
|
The failure of This inconsistency is agreed in the discussion at SPARK-23173. When WDYT cc: @gatorsmile @cloud-fan @HyukjinKwon |
Test build #94539 has finished for PR 20637 at commit
|
private def writeStructToBuffer( | ||
ctx: CodegenContext, | ||
input: String, | ||
index: String, | ||
fieldTypes: Seq[DataType], | ||
fieldTypeAndNullables: Seq[(DataType, Boolean)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we create a class for (DataType, Boolean)
? it can also be used in #22063
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be good since it is used at JavaTypeInference
and higherOrderFunctions
.
cc @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan What name of the case class do you suggest? DataTypeNullable
, or others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I found this case class case class Schema(dataType: DataType, nullable: Boolean)
at two places.
ScalaReflection.Schema
SchemaConverters.SchemaType
Do we use one of them? Or do we define org.apache.spark.sql.types.Schema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can define one within this class for readability. Probably we should take a look to deduplicate them for the instances you pointed out, but not sure yet for now if how much that deduplication improve the readability and if that's worth. Those are all look rather defined and used within small scopes rather locally.
@@ -170,6 +174,23 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro | |||
|
|||
val element = CodeGenerator.getValue(tmpInput, et, index) | |||
|
|||
val primitiveTypeName = if (CodeGenerator.isPrimitiveType(jt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
+1. |
+1 for ^ |
Test build #94806 has finished for PR 20637 at commit
|
Jenkins, retest this please. |
Test build #94878 has finished for PR 20637 at commit
|
Jenkins, retest this please. |
Test build #94881 has finished for PR 20637 at commit
|
Test build #94896 has finished for PR 20637 at commit
|
Test build #94899 has finished for PR 20637 at commit
|
retest this please |
Test build #94904 has finished for PR 20637 at commit
|
@@ -223,8 +223,9 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks with PlanTestBa | |||
} | |||
} else { | |||
val lit = InternalRow(expected, expected) | |||
val dtAsNullable = expression.dataType.asNullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't go through the entire thread. But my opinion is that, the data type nullable should match the real data.
BTW will this reduce test coverage? It seems the optimization for non-nullable fields is not tested if we always assume the expression is nullable.
we use the default value for the type, and create a wrong result instead of throwing a NPE.
This is expected and I think it's a common symptom for nullable-mismatch problems. Why can't our test expose it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin @cloud-fan Thank you for good summary.
I think that this does not reduce test coverage.
This dtAsNullable = expression.dataType.asNullable
is used only for generating expected
. This asNullable
does not change dataType
of expression
. Thus, this does not change our optimization assumption.
@@ -35,6 +35,24 @@ class ExpressionEvalHelperSuite extends SparkFunSuite with ExpressionEvalHelper | |||
val e = intercept[RuntimeException] { checkEvaluation(BadCodegenExpression(), 10) } | |||
assert(e.getMessage.contains("some_variable")) | |||
} | |||
|
|||
test("SPARK-23466: checkEvaluationWithUnsafeProjection should fail if null is compared with " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this test doing? Let the test framework discover wrongly written test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan This test tries to confirm whether Array(null, -1, 0, 1)
with ArrayType(IntegerType, false)
in expression2
should fail when it is compared with expected = Array(null, -1, 0, 1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? It looks to me that this is a malformed test case: you are putting a wrong expected value.
It's good to improve the test framework to detect this kind of wrongly written tests, but I don't think we have to do it now.
Another topic is, when the nullability doesn't match the data, how Spark should detect it. This is a different story and we can investigate later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let me drop this test case.
Test build #95479 has finished for PR 20637 at commit
|
with the test removed, do we still need this change? https://github.com/apache/spark/pull/20637/files#diff-41747ec3f56901eb7bfb95d2a217e94dR226 |
I believe we still need this change to correctly detect incorrect results in the future. |
can you give a concrete example of how this can detect incorrect results? The removed test case only shows how we can detect a wrongly written test. |
We need to detect the correctly written test with a wrong result. Let us think about the following In the following example, As described in the comment,
Since #22126 has been merged, the test is passed. If someone would unintentionally generate incorrect dataType, we must detect the mistake by failing the test without an exception. To avoid the exception, we need Is this an answer to your question?
|
This is the part I don't agree with. If someone writes a bad UT, it is good to get an exception. It is also an hint that the problem is in the UT itself, rather than in the code. But we already discussed this and we have different opinion on this. :) |
Note that if we miss non-primitive type tests unfortunately, we can't detect the bad UTs without |
Now the topic becomes detecting bad UTs, instead of "Remove redundant null checks". Can we focus on "Remove redundant null checks" in this PR and send another PR for detecting bad UTs? |
BTW there are so many kinds of bad UTs, we should clearly define the scope when fixing it. |
LGTM |
I see. Let me remove the change regarding |
Test build #95550 has finished for PR 20637 at commit
|
retest this please |
Test build #95560 has finished for PR 20637 at commit
|
Thanks! merging to master. |
What changes were proposed in this pull request?
This PR works for one of TODOs in
GenerateUnsafeProjection
"if the nullability of field is correct, we can use it to save null check" to simplify generated code.When
nullable=false
inDataType
,GenerateUnsafeProjection
removed code for null checks in the generated Java code.How was this patch tested?
Added new test cases into
GenerateUnsafeProjectionSuite