Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection #20637

Closed
wants to merge 14 commits into from

Conversation

kiszk
Copy link
Member

@kiszk kiszk commented Feb 19, 2018

What changes were proposed in this pull request?

This PR works for one of TODOs in GenerateUnsafeProjection "if the nullability of field is correct, we can use it to save null check" to simplify generated code.
When nullable=false in DataType, GenerateUnsafeProjection removed code for null checks in the generated Java code.

How was this patch tested?

Added new test cases into GenerateUnsafeProjectionSuite

@SparkQA
Copy link

SparkQA commented Feb 19, 2018

Test build #87544 has finished for PR 20637 at commit e2e9e36.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk kiszk changed the title Remove redundant null checks in generated Java code by GenerateUnsafeProjection [SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection Feb 20, 2018
@kiszk
Copy link
Member Author

kiszk commented Feb 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87548 has finished for PR 20637 at commit e2e9e36.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Feb 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87549 has finished for PR 20637 at commit e2e9e36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a random small comment. It seems reasonable though I don't know this code well.

val numVarLenFields = exprTypes.count {
case dt if UnsafeRow.isFixedLength(dt) => false
val numVarLenFields = exprTypeAndNullables.count {
case (dt, _) if UnsafeRow.isFixedLength(dt) => false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is .count { case (dt, _) => !UnsafeRow.isFixedLength(dt) } more straightforward?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment, otherwise seems ok to me

@@ -142,7 +143,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
case _ => s"$rowWriter.write($index, ${input.value});"
}

if (input.isNull == "false") {
if (input.isNull == "false" || !nullable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't those checks equivalent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks

@kiszk
Copy link
Member Author

kiszk commented Aug 8, 2018

retest this please

@@ -70,7 +72,8 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
| // Remember the current cursor so that we can calculate how many bytes are
| // written later.
| final int $previousCursor = $rowWriter.cursor();
| ${writeExpressionsToBuffer(ctx, tmpInput, fieldEvals, fieldTypes, structRowWriter)}
| ${writeExpressionsToBuffer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in situations like this I have been told to move the function call before and assign it to a variable

val elementAssignment = if (elementNullable) {
s"""
|if ($tmpInput.isNullAt($index)) {
| $arrayWriter.setNull$primitiveTypeName($index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using .setNull${elementOrOffsetSize}Bytes as it was before?

@@ -219,15 +235,17 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
| // Remember the current cursor so that we can write numBytes of key array later.
| final int $tmpCursor = $rowWriter.cursor();
|
| ${writeArrayToBuffer(ctx, s"$tmpInput.keyArray()", keyType, rowWriter)}
| ${writeArrayToBuffer(
ctx, s"$tmpInput.keyArray()", keyType, false, rowWriter)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be on one line right?

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94445 has finished for PR 20637 at commit 2d5c2eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94464 has finished for PR 20637 at commit 3ec2b19.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 10, 2018

The failure of org.apache.spark.sql.catalyst.expressions.JsonExpressionsSuite.from_json missing fields is due to passing null while the schema has nullable=false.

This inconsistency is agreed in the discussion at SPARK-23173.
Assume that each field in schema passed to from_json is nullable, and ignore the nullability information set in the passed schema.

When spark.sql.fromJsonForceNullableSchema=false, I think that a test is invalid to pass nullable=false in the corresponding schema to the missing field.

WDYT cc: @gatorsmile @cloud-fan @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94539 has finished for PR 20637 at commit 0f7ae11.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private def writeStructToBuffer(
ctx: CodegenContext,
input: String,
index: String,
fieldTypes: Seq[DataType],
fieldTypeAndNullables: Seq[(DataType, Boolean)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we create a class for (DataType, Boolean)? it can also be used in #22063

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be good since it is used at JavaTypeInference and higherOrderFunctions.
cc @ueshin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan What name of the case class do you suggest? DataTypeNullable, or others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I found this case class case class Schema(dataType: DataType, nullable: Boolean) at two places.

  1. ScalaReflection.Schema
  2. SchemaConverters.SchemaType

Do we use one of them? Or do we define org.apache.spark.sql.types.Schema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can define one within this class for readability. Probably we should take a look to deduplicate them for the instances you pointed out, but not sure yet for now if how much that deduplication improve the readability and if that's worth. Those are all look rather defined and used within small scopes rather locally.

@@ -170,6 +174,23 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro

val element = CodeGenerator.getValue(tmpInput, et, index)

val primitiveTypeName = if (CodeGenerator.isPrimitiveType(jt)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@cloud-fan
Copy link
Contributor

When spark.sql.fromJsonForceNullableSchema=false, I think that a test is invalid to pass nullable=false in the corresponding schema to the missing field.

+1.

@HyukjinKwon
Copy link
Member

+1 for ^

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94806 has finished for PR 20637 at commit 99731ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Schema(dataType: DataType, nullable: Boolean)

@ueshin
Copy link
Member

ueshin commented Aug 17, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94878 has finished for PR 20637 at commit 99731ca.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Schema(dataType: DataType, nullable: Boolean)

@ueshin
Copy link
Member

ueshin commented Aug 17, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94881 has finished for PR 20637 at commit 99731ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Schema(dataType: DataType, nullable: Boolean)

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94896 has finished for PR 20637 at commit 675409e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94899 has finished for PR 20637 at commit 84961b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 17, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94904 has finished for PR 20637 at commit 84961b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 18, 2018

cc @ueshin @cloud-fan

@@ -223,8 +223,9 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks with PlanTestBa
}
} else {
val lit = InternalRow(expected, expected)
val dtAsNullable = expression.dataType.asNullable
Copy link
Contributor

@cloud-fan cloud-fan Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through the entire thread. But my opinion is that, the data type nullable should match the real data.

BTW will this reduce test coverage? It seems the optimization for non-nullable fields is not tested if we always assume the expression is nullable.

we use the default value for the type, and create a wrong result instead of throwing a NPE.

This is expected and I think it's a common symptom for nullable-mismatch problems. Why can't our test expose it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin @cloud-fan Thank you for good summary.

I think that this does not reduce test coverage.
This dtAsNullable = expression.dataType.asNullable is used only for generating expected. This asNullable does not change dataType of expression. Thus, this does not change our optimization assumption.

@@ -35,6 +35,24 @@ class ExpressionEvalHelperSuite extends SparkFunSuite with ExpressionEvalHelper
val e = intercept[RuntimeException] { checkEvaluation(BadCodegenExpression(), 10) }
assert(e.getMessage.contains("some_variable"))
}

test("SPARK-23466: checkEvaluationWithUnsafeProjection should fail if null is compared with " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this test doing? Let the test framework discover wrongly written test cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan This test tries to confirm whether Array(null, -1, 0, 1) with ArrayType(IntegerType, false) in expression2 should fail when it is compared with expected = Array(null, -1, 0, 1) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? It looks to me that this is a malformed test case: you are putting a wrong expected value.

It's good to improve the test framework to detect this kind of wrongly written tests, but I don't think we have to do it now.

Another topic is, when the nullability doesn't match the data, how Spark should detect it. This is a different story and we can investigate later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let me drop this test case.

@SparkQA
Copy link

SparkQA commented Aug 30, 2018

Test build #95479 has finished for PR 20637 at commit 37dc4d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

with the test removed, do we still need this change? https://github.com/apache/spark/pull/20637/files#diff-41747ec3f56901eb7bfb95d2a217e94dR226

@kiszk
Copy link
Member Author

kiszk commented Aug 31, 2018

I believe we still need this change to correctly detect incorrect results in the future.

@cloud-fan
Copy link
Contributor

can you give a concrete example of how this can detect incorrect results? The removed test case only shows how we can detect a wrongly written test.

@kiszk
Copy link
Member Author

kiszk commented Aug 31, 2018

We need to detect the correctly written test with a wrong result. Let us think about the following map_zip_with without #22126.

In the following example, map_zip_with without #22126 return expr that has Map(1 -> -10, 4 -> null) with MapType(IntegerType, IntegerType, valueContainsNull = false). The DataTypeP is not incorrect since Map(...)includesnull`. Thus, the test must be failed.

As described in the comment,

  • With asNullable: the test fails with incorrect evaluation (it is not passing expectedly)
  • Without asNullable: the test fails with NullPointerException (it is not passing unexpectedly).

Since #22126 has been merged, the test is passed. If someone would unintentionally generate incorrect dataType, we must detect the mistake by failing the test without an exception. To avoid the exception, we need asNullable.

Is this an answer to your question?

val miia = Literal.create(Map(1 -> 10),
  MapType(IntegerType, IntegerType, valueContainsNull = false))
val miib = Literal.create(Map(1 -> -1, 4 -> -4),
  MapType(IntegerType, IntegerType, valueContainsNull = false))

val expr = map_zip_with(miia, miib, multiplyKeyWithValues)
checkEvaluation(expr, Map(1 -> -10, 4 -> null))

@mgaido91
Copy link
Contributor

If someone would unintentionally generate incorrect dataType, we must detect the mistake by failing the test without an exception.

This is the part I don't agree with. If someone writes a bad UT, it is good to get an exception. It is also an hint that the problem is in the UT itself, rather than in the code. But we already discussed this and we have different opinion on this. :)

@ueshin
Copy link
Member

ueshin commented Aug 31, 2018

Note that if we miss non-primitive type tests unfortunately, we can't detect the bad UTs without asNullable because they won't throw any exceptions but just use the default value.

@cloud-fan
Copy link
Contributor

Now the topic becomes detecting bad UTs, instead of "Remove redundant null checks".

Can we focus on "Remove redundant null checks" in this PR and send another PR for detecting bad UTs?

@cloud-fan
Copy link
Contributor

BTW there are so many kinds of bad UTs, we should clearly define the scope when fixing it.

@mgaido91
Copy link
Contributor

LGTM

@kiszk
Copy link
Member Author

kiszk commented Aug 31, 2018

I see. Let me remove the change regarding asNullable from this PR. I will create another PR after this PR is merged.

@SparkQA
Copy link

SparkQA commented Aug 31, 2018

Test build #95550 has finished for PR 20637 at commit 88c74c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 31, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 31, 2018

Test build #95560 has finished for PR 20637 at commit 88c74c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Sep 1, 2018

Thanks! merging to master.

@asfgit asfgit closed this in c5583fd Sep 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants