[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection #20637

kiszk · 2018-02-19T19:33:53Z

What changes were proposed in this pull request?

This PR works for one of TODOs in GenerateUnsafeProjection "if the nullability of field is correct, we can use it to save null check" to simplify generated code.
When nullable=false in DataType, GenerateUnsafeProjection removed code for null checks in the generated Java code.

How was this patch tested?

Added new test cases into GenerateUnsafeProjectionSuite

SparkQA · 2018-02-19T21:58:34Z

Test build #87544 has finished for PR 20637 at commit e2e9e36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-20T01:14:40Z

retest this please

SparkQA · 2018-02-20T03:41:35Z

Test build #87548 has finished for PR 20637 at commit e2e9e36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-20T03:51:24Z

retest this please

SparkQA · 2018-02-20T07:07:42Z

Test build #87549 has finished for PR 20637 at commit e2e9e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Just left a random small comment. It seems reasonable though I don't know this code well.

srowen · 2018-07-31T22:23:40Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

-    val numVarLenFields = exprTypes.count {
-      case dt if UnsafeRow.isFixedLength(dt) => false
+    val numVarLenFields = exprTypeAndNullables.count {
+      case (dt, _) if UnsafeRow.isFixedLength(dt) => false


is .count { case (dt, _) => !UnsafeRow.isFixedLength(dt) } more straightforward?

mgaido91

just one comment, otherwise seems ok to me

mgaido91 · 2018-08-01T11:18:50Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

@@ -142,7 +143,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
          case _ => s"$rowWriter.write($index, ${input.value});"
        }

-        if (input.isNull == "false") {
+        if (input.isNull == "false" || !nullable) {


aren't those checks equivalent?

good catch, thanks

kiszk · 2018-08-08T18:38:44Z

retest this please

mgaido91 · 2018-08-08T18:54:58Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

@@ -70,7 +72,8 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
       |  // Remember the current cursor so that we can calculate how many bytes are
       |  // written later.
       |  final int $previousCursor = $rowWriter.cursor();
-       |  ${writeExpressionsToBuffer(ctx, tmpInput, fieldEvals, fieldTypes, structRowWriter)}
+       |  ${writeExpressionsToBuffer(


nit: in situations like this I have been told to move the function call before and assign it to a variable

mgaido91 · 2018-08-08T18:56:59Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

+    val elementAssignment = if (elementNullable) {
+      s"""
+         |if ($tmpInput.isNullAt($index)) {
+         |  $arrayWriter.setNull$primitiveTypeName($index);


why not using .setNull${elementOrOffsetSize}Bytes as it was before?

mgaido91 · 2018-08-08T19:02:26Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

@@ -219,15 +235,17 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
       |  // Remember the current cursor so that we can write numBytes of key array later.
       |  final int $tmpCursor = $rowWriter.cursor();
       |
-       |  ${writeArrayToBuffer(ctx, s"$tmpInput.keyArray()", keyType, rowWriter)}
+       |  ${writeArrayToBuffer(
+               ctx, s"$tmpInput.keyArray()", keyType, false, rowWriter)}


this can be on one line right?

SparkQA · 2018-08-08T19:55:50Z

Test build #94445 has finished for PR 20637 at commit 2d5c2eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T03:40:24Z

Test build #94464 has finished for PR 20637 at commit 3ec2b19.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-10T02:35:23Z

The failure of org.apache.spark.sql.catalyst.expressions.JsonExpressionsSuite.from_json missing fields is due to passing null while the schema has nullable=false.

This inconsistency is agreed in the discussion at SPARK-23173.
Assume that each field in schema passed to from_json is nullable, and ignore the nullability information set in the passed schema.

When spark.sql.fromJsonForceNullableSchema=false, I think that a test is invalid to pass nullable=false in the corresponding schema to the missing field.

WDYT cc: @gatorsmile @cloud-fan @HyukjinKwon

SparkQA · 2018-08-10T05:46:35Z

Test build #94539 has finished for PR 20637 at commit 0f7ae11.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-10T06:53:48Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

  private def writeStructToBuffer(
      ctx: CodegenContext,
      input: String,
      index: String,
-      fieldTypes: Seq[DataType],
+      fieldTypeAndNullables: Seq[(DataType, Boolean)],


shall we create a class for (DataType, Boolean)? it can also be used in #22063

I think that it would be good since it is used at JavaTypeInference and higherOrderFunctions.
cc @ueshin

@cloud-fan What name of the case class do you suggest? DataTypeNullable, or others?

@cloud-fan I found this case class case class Schema(dataType: DataType, nullable: Boolean) at two places.

ScalaReflection.Schema

SchemaConverters.SchemaType

Do we use one of them? Or do we define org.apache.spark.sql.types.Schema?

I think you can define one within this class for readability. Probably we should take a look to deduplicate them for the instances you pointed out, but not sure yet for now if how much that deduplication improve the readability and if that's worth. Those are all look rather defined and used within small scopes rather locally.

cloud-fan · 2018-08-10T06:57:35Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

@@ -170,6 +174,23 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro

    val element = CodeGenerator.getValue(tmpInput, et, index)

+    val primitiveTypeName = if (CodeGenerator.isPrimitiveType(jt)) {


where do we use it?

cloud-fan · 2018-08-10T07:01:54Z

When spark.sql.fromJsonForceNullableSchema=false, I think that a test is invalid to pass nullable=false in the corresponding schema to the missing field.

+1.

HyukjinKwon · 2018-08-13T08:58:12Z

+1 for ^

SparkQA · 2018-08-15T18:19:01Z

Test build #94806 has finished for PR 20637 at commit 99731ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Schema(dataType: DataType, nullable: Boolean)

ueshin · 2018-08-17T06:06:47Z

Jenkins, retest this please.

SparkQA · 2018-08-17T07:05:01Z

Test build #94878 has finished for PR 20637 at commit 99731ca.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Schema(dataType: DataType, nullable: Boolean)

ueshin · 2018-08-17T07:30:05Z

Jenkins, retest this please.

SparkQA · 2018-08-17T10:50:06Z

Test build #94881 has finished for PR 20637 at commit 99731ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Schema(dataType: DataType, nullable: Boolean)

SparkQA · 2018-08-17T15:00:09Z

Test build #94896 has finished for PR 20637 at commit 675409e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-17T17:18:47Z

Test build #94899 has finished for PR 20637 at commit 84961b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-17T18:45:36Z

retest this please

SparkQA · 2018-08-17T22:41:54Z

Test build #94904 has finished for PR 20637 at commit 84961b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-18T03:51:25Z

cc @ueshin @cloud-fan

cloud-fan · 2018-08-27T12:45:18Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

@@ -223,8 +223,9 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks with PlanTestBa
          }
        } else {
          val lit = InternalRow(expected, expected)
+          val dtAsNullable = expression.dataType.asNullable


I didn't go through the entire thread. But my opinion is that, the data type nullable should match the real data.

BTW will this reduce test coverage? It seems the optimization for non-nullable fields is not tested if we always assume the expression is nullable.

we use the default value for the type, and create a wrong result instead of throwing a NPE.

This is expected and I think it's a common symptom for nullable-mismatch problems. Why can't our test expose it?

@ueshin @cloud-fan Thank you for good summary.

I think that this does not reduce test coverage.
This dtAsNullable = expression.dataType.asNullable is used only for generating expected. This asNullable does not change dataType of expression. Thus, this does not change our optimization assumption.

cloud-fan · 2018-08-29T06:19:16Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelperSuite.scala

@@ -35,6 +35,24 @@ class ExpressionEvalHelperSuite extends SparkFunSuite with ExpressionEvalHelper
    val e = intercept[RuntimeException] { checkEvaluation(BadCodegenExpression(), 10) }
    assert(e.getMessage.contains("some_variable"))
  }
+
+  test("SPARK-23466: checkEvaluationWithUnsafeProjection should fail if null is compared with " +


what is this test doing? Let the test framework discover wrongly written test cases?

@cloud-fan This test tries to confirm whether Array(null, -1, 0, 1) with ArrayType(IntegerType, false) in expression2 should fail when it is compared with expected = Array(null, -1, 0, 1) .

is this necessary? It looks to me that this is a malformed test case: you are putting a wrong expected value.

It's good to improve the test framework to detect this kind of wrongly written tests, but I don't think we have to do it now.

Another topic is, when the nullability doesn't match the data, how Spark should detect it. This is a different story and we can investigate later.

I see. Let me drop this test case.

SparkQA · 2018-08-30T20:35:33Z

Test build #95479 has finished for PR 20637 at commit 37dc4d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-31T05:00:36Z

with the test removed, do we still need this change? https://github.com/apache/spark/pull/20637/files#diff-41747ec3f56901eb7bfb95d2a217e94dR226

kiszk · 2018-08-31T06:39:43Z

I believe we still need this change to correctly detect incorrect results in the future.

cloud-fan · 2018-08-31T07:00:21Z

can you give a concrete example of how this can detect incorrect results? The removed test case only shows how we can detect a wrongly written test.

kiszk · 2018-08-31T10:01:43Z

We need to detect the correctly written test with a wrong result. Let us think about the following map_zip_with without #22126.

In the following example, map_zip_with without #22126 return expr that has Map(1 -> -10, 4 -> null) with MapType(IntegerType, IntegerType, valueContainsNull = false). The DataTypeP is not incorrect since Map(...)includesnull`. Thus, the test must be failed.

As described in the comment,

With asNullable: the test fails with incorrect evaluation (it is not passing expectedly)
Without asNullable: the test fails with NullPointerException (it is not passing unexpectedly).

Since #22126 has been merged, the test is passed. If someone would unintentionally generate incorrect dataType, we must detect the mistake by failing the test without an exception. To avoid the exception, we need asNullable.

Is this an answer to your question?

val miia = Literal.create(Map(1 -> 10),
  MapType(IntegerType, IntegerType, valueContainsNull = false))
val miib = Literal.create(Map(1 -> -1, 4 -> -4),
  MapType(IntegerType, IntegerType, valueContainsNull = false))

val expr = map_zip_with(miia, miib, multiplyKeyWithValues)
checkEvaluation(expr, Map(1 -> -10, 4 -> null))

mgaido91 · 2018-08-31T10:10:31Z

If someone would unintentionally generate incorrect dataType, we must detect the mistake by failing the test without an exception.

This is the part I don't agree with. If someone writes a bad UT, it is good to get an exception. It is also an hint that the problem is in the UT itself, rather than in the code. But we already discussed this and we have different opinion on this. :)

ueshin · 2018-08-31T11:35:29Z

Note that if we miss non-primitive type tests unfortunately, we can't detect the bad UTs without asNullable because they won't throw any exceptions but just use the default value.

cloud-fan · 2018-08-31T12:01:09Z

Now the topic becomes detecting bad UTs, instead of "Remove redundant null checks".

Can we focus on "Remove redundant null checks" in this PR and send another PR for detecting bad UTs?

cloud-fan · 2018-08-31T12:01:53Z

BTW there are so many kinds of bad UTs, we should clearly define the scope when fixing it.

mgaido91 · 2018-08-31T15:16:08Z

LGTM

kiszk · 2018-08-31T15:25:42Z

I see. Let me remove the change regarding asNullable from this PR. I will create another PR after this PR is merged.

SparkQA · 2018-08-31T16:45:47Z

Test build #95550 has finished for PR 20637 at commit 88c74c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-31T19:20:03Z

retest this please

SparkQA · 2018-08-31T23:22:03Z

Test build #95560 has finished for PR 20637 at commit 88c74c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-09-01T03:18:45Z

Thanks! merging to master.

kiszk changed the title ~~Remove redundant null checks in generated Java code by GenerateUnsafeProjection~~ [SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection Feb 20, 2018

srowen reviewed Jul 31, 2018

View reviewed changes

mgaido91 reviewed Aug 1, 2018

View reviewed changes

kiszk force-pushed the SPARK-23466 branch from e2e9e36 to 2d5c2eb Compare August 8, 2018 18:04

mgaido91 reviewed Aug 8, 2018

View reviewed changes

cloud-fan reviewed Aug 10, 2018

View reviewed changes

kiszk force-pushed the SPARK-23466 branch from 99731ca to 675409e Compare August 17, 2018 14:56

kiszk added 9 commits August 26, 2018 01:10

fix test failure of JsonExpressionsSuite

427b529

address review comment

c6146e1

updates

5f14c15

improve test coverage regarding nullable in Unsafe structures

6957dc8

remove debug code

14590b7

address review comments

e89954a

address review comment

b346434

address review comment

980ca2e

address review comment

bbc9340

cloud-fan reviewed Aug 27, 2018

View reviewed changes

cloud-fan reviewed Aug 29, 2018

View reviewed changes

address review comments

37dc4d8

kiszk force-pushed the SPARK-23466 branch from d7ef82c to 37dc4d8 Compare August 30, 2018 16:27

address review comment

88c74c6

asfgit closed this in c5583fd Sep 1, 2018

		@@ -170,6 +174,23 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro

		val element = CodeGenerator.getValue(tmpInput, et, index)

		val primitiveTypeName = if (CodeGenerator.isPrimitiveType(jt)) {

[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection #20637

[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection #20637

Conversation

kiszk commented Feb 19, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 19, 2018

kiszk commented Feb 20, 2018

SparkQA commented Feb 20, 2018

kiszk commented Feb 20, 2018

SparkQA commented Feb 20, 2018

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Aug 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 8, 2018

SparkQA commented Aug 9, 2018

kiszk commented Aug 10, 2018 • edited Loading

SparkQA commented Aug 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 10, 2018

HyukjinKwon commented Aug 13, 2018

SparkQA commented Aug 15, 2018

ueshin commented Aug 17, 2018

SparkQA commented Aug 17, 2018

ueshin commented Aug 17, 2018

SparkQA commented Aug 17, 2018

SparkQA commented Aug 17, 2018

SparkQA commented Aug 17, 2018

kiszk commented Aug 17, 2018

SparkQA commented Aug 17, 2018

kiszk commented Aug 18, 2018

cloud-fan Aug 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 30, 2018

cloud-fan commented Aug 31, 2018

kiszk commented Aug 31, 2018 • edited Loading

cloud-fan commented Aug 31, 2018

kiszk commented Aug 31, 2018

mgaido91 commented Aug 31, 2018

ueshin commented Aug 31, 2018

cloud-fan commented Aug 31, 2018

cloud-fan commented Aug 31, 2018

mgaido91 commented Aug 31, 2018

kiszk commented Aug 31, 2018

SparkQA commented Aug 31, 2018

kiszk commented Aug 31, 2018

SparkQA commented Aug 31, 2018

ueshin commented Sep 1, 2018

kiszk commented Feb 19, 2018 •

edited

Loading

kiszk commented Aug 10, 2018 •

edited

Loading

cloud-fan Aug 27, 2018 •

edited

Loading

kiszk commented Aug 31, 2018 •

edited

Loading