AnsiMode support for GetArrayItem GetMapValue and ElementAt for Spark 3.1.1 #2350

wjxiz1992 · 2021-05-06T14:32:54Z

Fix #2272 and #2276
This PR adds shim support for GetArrayItem, GetMapValue and ElementAt to match the CPU behavior on Spark 3.1.1.

This relies on rapidsai/cudf#8209 and #2260.

More:
This adds an parameter all_null for ArrayGen in the data_gen part in integration_test.
This parameter is used to create null array instead of empty array.
Null array is used to create a corner case for GetArrayItem:

For a dataframe like:

+------------------------+
|col_1                   |
+------------------------+
|null                    |
|null                    |
+------------------------+

df.select(col("col_1")[2]).show() will return without exception:

+--------+
|col_1[2]|
+--------+
|    null|
|    null|
+--------+

Signed-off-by: Allen Xu <[email protected]>

- doc refine Signed-off-by: Allen Xu <[email protected]>

Signed-off-by: Allen Xu <[email protected]>

- for ansi_mode=true Signed-off-by: Allen Xu <[email protected]>

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

Signed-off-by: Allen Xu <[email protected]>

revans2 · 2021-05-06T17:52:28Z

shims/spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/Spark311Shims.scala

@@ -17,13 +17,11 @@
 package com.nvidia.spark.rapids.shims.spark311

 import java.nio.ByteBuffer
-


scala style should be returning errors for these being removed,

integration_tests/src/main/python/array_test.py

revans2 · 2021-05-06T17:58:36Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

 }

 /**
 * Returns the field at `ordinal` in the Array `child`.
 *
 * We need to do type checking here as `ordinal` expression maybe unresolved.
 */
-case class GpuGetArrayItem(child: Expression, ordinal: Expression)
+case class GpuGetArrayItem(child: Expression, ordinal: Expression, failOnError: Boolean = false)


nit: I personally would prefer to not have a default value for failOnError, just so we are explicit about it everywhere.

revans2 · 2021-05-06T18:00:08Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

@@ -87,15 +87,15 @@ class GpuGetArrayItemMeta(
  override def convertToGpu(
      arr: Expression,
      ordinal: Expression): GpuExpression =
-    GpuGetArrayItem(arr, ordinal)
+    GpuGetArrayItem(arr, ordinal, SQLConf.get.ansiEnabled)


This is wrong. In version prior to 3.1.1 the default value should be false, not based off of the ansiEnabled config. Otherwise we will fail when spark does not.

Set to false and add comment.

integration_tests/src/main/python/array_test.py

revans2

A few more bugs that I saw when I took a closer look

revans2 · 2021-05-06T18:04:16Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

-    if (ordinal.isValid && ordinal.getInt >= 0) {
-      lhs.getBase.extractListElement(ordinal.getInt)
+    if (ordinal.isValid) {
+      val minNumElements = lhs.getBase.countElements.min.getInt


This leaks a ColumnVector and a Scalar. The result of countElements must be closed and so much the result of min

revans2 · 2021-05-06T18:09:39Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

-      lhs.getBase.extractListElement(ordinal.getInt)
+    if (ordinal.isValid) {
+      val minNumElements = lhs.getBase.countElements.min.getInt
+      if ( (ordinal.getInt < 0 || minNumElements < ordinal.getInt + 1) && failOnError) {


What is supposed to happen with a null array? countElements will return a null for a null array and min skips over nulls unless all of them are null. So is null[1] an error in ansi mode or not? If it is then this code will completely miss it. If it is not an error, then we will get an exception, or possibly data corruption when we try to get the int value from the result of min if the batch is all null arrays.

I did a simple test about this problem:

// row data is like: +------------------------+ |col_1 | +------------------------+ |null | |[Java, Scala, C++, a, b]| +------------------------+ df.select(col("col_1")[2]).show() +--------+ |col_1[2]| +--------+ | null| | C++| +--------+

CPU and GPU will return the same result for this case.
But like you said, error occurs when the column contains all null arrays.
the countElements works well, but the min will return 0 for this case and will throw the exception here.
(CPU will still return null for them)

For the all_null case, I plan to compare getNullCount with getRowCount. But the getNullCount says it's a very expensive op. Do you think we should apply this method here?
@revans2

revans2

Looking better. My main concern right now is that I don't see how we are setting failOnError properly for Spark3.1.1+ I think we need to either check the version number in the meta, which is brittle, or preferably put the rule into the Shim layer so Spark 3.1.1 can override the behavior.

revans2 · 2021-05-07T12:41:58Z

integration_tests/src/main/python/array_test.py

+                                     'spark.sql.legacy.allowNegativeScaleOfDecimal': True},
+                               error_message='java.lang.ArrayIndexOutOfBoundsException')
+
+@pytest.mark.skipif(not is_before_spark_311(), reason="This will throw exception only in Spark 3.1.1+")


The reason looks like a copy and paste. Under this test is in not clear what it means. It might be nice to update both to say something like "In Spark 3.1.1+ ANSI mode array index throws on out of range indexes"

Looking better. My main concern right now is that I don't see how we are setting failOnError properly for Spark3.1.1+ I think we need to either check the version number in the meta, which is brittle, or preferably put the rule into the Shim layer so Spark 3.1.1 can override the behavior.

Agree, currently I set failOnError to false by default in the convertToGpu for all Spark before 311, because ANSI mode has no effects on them. It has the same behavior as Spark311+ANSI mode=false.

For Spark311, I put real ANSI mode config in the convertToGpu to change the behavior for ANSI=true or false.

Sorry I missed that. Looks good then all I have is this nit.

Signed-off-by: Allen Xu <[email protected]>

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

Signed-off-by: Allen Xu <[email protected]>

firestarman

LGTM, only some nits.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

wjxiz1992 · 2021-05-17T12:23:30Z

build

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 · 2021-05-18T03:41:21Z

build

… 3.1.1 (NVIDIA#2350) To match the behavior of GetArrayItem, GetMapValue and ElementAt with CPU in Spark 3.1.1. Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 added 16 commits May 6, 2021 16:40

Initial support for elementAt

fd3967d

temp work saving

4e42d39

draft element_at

6b83aad

Support element_at

3b6425f

Signed-off-by: Allen Xu <[email protected]>

debug for array of array

fc5f6fc

Fix array of array bug

0239e18

code clean

7a33a90

Update doc

2c47adf

Signed-off-by: Allen Xu <[email protected]>

remove test code

7bf2717

make Spark input more strict

f39ee18

- doc refine Signed-off-by: Allen Xu <[email protected]>

resolve comments

e00d856

Signed-off-by: Allen Xu <[email protected]>

Resolve comments

4e0791f

Signed-off-by: Allen Xu <[email protected]>

resolve comments, rebase to latest

272b358

Signed-off-by: Allen Xu <[email protected]>

Add ansi mode for GetArrayItem

ed1b868

Signed-off-by: Allen Xu <[email protected]>

Update support_op_docs

a19ddf6

Signed-off-by: Allen Xu <[email protected]>

Add test case for GetArrayItem

eea8ff2

- for ansi_mode=true Signed-off-by: Allen Xu <[email protected]>

razajafri reviewed May 6, 2021

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala Outdated Show resolved Hide resolved

Fix bug, using logical operator

e731ad7

Signed-off-by: Allen Xu <[email protected]>

revans2 reviewed May 6, 2021

View reviewed changes

revans2 requested changes May 6, 2021

View reviewed changes

wjxiz1992 added 4 commits May 7, 2021 11:22

Merge remote-tracking branch 'origin/branch-0.6' into HEAD

999f451

Fix a leak problem

835a35f

Add tests and fix bugs

acf841c

Merge remote-tracking branch 'origin/branch-0.6' into HEAD

13ef197

revans2 mentioned this pull request May 7, 2021

Support ElementAt #2260

Merged

revans2 reviewed May 7, 2021

View reviewed changes

wjxiz1992 added 4 commits May 8, 2021 10:03

Update test skip reasons

247d534

Refactor GetMapValue and GetArrayItem core methods

be05e2f

Merge remote-tracking branch 'origin/branch-0.6' into HEAD

f31ec78

Merge remote-tracking branch 'wjxiz/elementat' into ansi-getArrayItem

3164358

wjxiz1992 changed the title ~~AnsiMode support for GetArrayItem for Spark 3.1.1~~ AnsiMode support for GetArrayItem GetMapValue and ElementAt for Spark 3.1.1 May 14, 2021

wjxiz1992 added 7 commits May 14, 2021 16:25

Add integration tests for GetMapValue case

aae7f07

revert refactor

ae6f8d0

Signed-off-by: Allen Xu <[email protected]>

final fix

03110ab

Signed-off-by: Allen Xu <[email protected]>

Merge remote-tracking branch 'wjxiz/elementat' into HEAD

a7ea67e

Add corner case

0353dac

Merge remote-tracking branch 'origin/branch-0.6' into HEAD

df7570e

clean and update

42f8ef0

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 marked this pull request as ready for review May 17, 2021 04:01

firestarman reviewed May 17, 2021

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala Outdated Show resolved Hide resolved

Resolve comments

5b5e7b0

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 requested a review from firestarman May 17, 2021 08:52

firestarman reviewed May 17, 2021

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala Outdated Show resolved Hide resolved

code clean

6214d13

revans2 previously approved these changes May 17, 2021

View reviewed changes

wjxiz1992 added 2 commits May 18, 2021 11:31

Merge remote-tracking branch 'origin/branch-0.6' into HEAD

3858dd3

fix a bug when oridnal < 0 in 3.0 version

e7452ee

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 dismissed revans2’s stale review via e7452ee May 18, 2021 03:40

wjxiz1992 requested a review from firestarman May 18, 2021 06:35

firestarman approved these changes May 18, 2021

View reviewed changes

wjxiz1992 merged commit d65476d into NVIDIA:branch-0.6 May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnsiMode support for GetArrayItem GetMapValue and ElementAt for Spark 3.1.1 #2350

AnsiMode support for GetArrayItem GetMapValue and ElementAt for Spark 3.1.1 #2350

wjxiz1992 commented May 6, 2021 •

edited

Loading

revans2 May 6, 2021

wjxiz1992 May 14, 2021

revans2 May 6, 2021

wjxiz1992 May 14, 2021

revans2 May 6, 2021

wjxiz1992 May 14, 2021

revans2 left a comment

revans2 May 6, 2021

revans2 May 6, 2021

wjxiz1992 May 12, 2021 •

edited

Loading

wjxiz1992 May 17, 2021

revans2 left a comment

revans2 May 7, 2021

wjxiz1992 May 7, 2021

revans2 May 7, 2021

firestarman left a comment

wjxiz1992 commented May 17, 2021

wjxiz1992 commented May 18, 2021

		@@ -17,13 +17,11 @@
		package com.nvidia.spark.rapids.shims.spark311

		import java.nio.ByteBuffer

AnsiMode support for GetArrayItem GetMapValue and ElementAt for Spark 3.1.1 #2350

AnsiMode support for GetArrayItem GetMapValue and ElementAt for Spark 3.1.1 #2350

Conversation

wjxiz1992 commented May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjxiz1992 May 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman left a comment

Choose a reason for hiding this comment

wjxiz1992 commented May 17, 2021

wjxiz1992 commented May 18, 2021

wjxiz1992 commented May 6, 2021 •

edited

Loading

wjxiz1992 May 12, 2021 •

edited

Loading