Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound #30297

Closed
wants to merge 8 commits into from

Conversation

leanken-zz
Copy link
Contributor

@leanken-zz leanken-zz commented Nov 9, 2020

What changes were proposed in this pull request?

Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for element_atelt, GetArrayItem functions.

Why are the changes needed?

For ansiMode.

Does this PR introduce any user-facing change?

When spark.sql.ansi.enabled = true, Spark will throw ArrayIndexOutOfBoundsException if out-of-range index when accessing array elements

How was this patch tested?

Added UT and existing UT.

@github-actions github-actions bot added the SQL label Nov 9, 2020
@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35390/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35390/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Test build #130781 has finished for PR 30297 at commit d1e0ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ElementAt(
  • case class GetArrayItem(
  • case class Elt(

@dongjoon-hyun
Copy link
Member

cc @maropu

@maropu
Copy link
Member

maropu commented Nov 10, 2020

I think we need to update the ANSI doc accordingly if we change the behaviours: https://github.com/apache/spark/blame/master/docs/sql-ref-ansi-compliance.md#L110-L114 btw, is this change related to ANSI? The ANSI standard says something about these kinds of behaviours? @gengliangwang

@maropu
Copy link
Member

maropu commented Nov 10, 2020

Also, its better to update the doc of the ANSI config (e.g., 3. Spark will change the behavior of some SQL functions, e.g., size, ...), too.

val ANSI_ENABLED = buildConf("spark.sql.ansi.enabled")
.doc("When true, Spark tries to conform to the ANSI SQL specification: 1. Spark will " +
"throw a runtime exception if an overflow occurs in any operation on integral/decimal " +
"field. 2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in " +
"the SQL parser.")

}
assert(ex.getMessage.contains("Invalid index: 4"))
} else {
checkAnswer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you put this in a single line?

checkAnswer(df, Row(null))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for -1 or 0?

}
assert(ex.getMessage.contains("Invalid index: 5"))
} else {
checkAnswer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add fine-grained tests for element_at and elt via checkExceptionInExpression?

@@ -1919,9 +1919,14 @@ case class ArrayPosition(left: Expression, right: Expression)
b
""",
since = "2.4.0")
case class ElementAt(left: Expression, right: Expression)
case class ElementAt(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the usage above of ExpressionDescription , too?

@@ -239,7 +240,11 @@ case class ConcatWs(children: Seq[Expression])
""",
since = "2.0.0")
// scalastyle:on line.size.limit
case class Elt(children: Seq[Expression]) extends Expression {
case class Elt(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq(true, false).foreach { ansiEnabled =>
withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) {
val typeA = ArrayType(StringType)
val array = Literal.create(Seq("a", "b"), typeA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: val array = Literal.create(Seq("a", "b"), ArrayType(StringType))?

val ex = intercept[Exception] {
checkEvaluation(GetArrayItem(array, Literal(5)), null)
}
assert(stackTraceToString(ex).contains("Invalid index: 5"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for -1?


if (ansiEnabled) {
val ex = intercept[Exception] {
checkEvaluation(GetArrayItem(array, Literal(5)), null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use checkExceptionInExpression instead?

Change-Id: If7650cc45aa30fd3d6549536a8af6ca01a746c39
2. add more UT scenario.
3. update nullability calculation for ElementAt and update UT.

Change-Id: Ieec69fe5171dfee2863ad93f84bb660076214de0
@leanken-zz
Copy link
Contributor Author

@maropu updated with all your comments, plz have a look when you have time. ^_^
And also FYI, @cloud-fan

accesses elements from the last to the first. Returns NULL if the index exceeds the length
of the array.
accesses elements from the last to the first. If the index exceeds the length of the array,
Returns NULL if Ansi mode is off; Throws ArrayIndexOutOfBoundsException when Ansi mode is on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ansi -> ANSI

@@ -1969,7 +1974,7 @@ case class ElementAt(left: Expression, right: Expression)
if (ordinal == 0) {
false
} else if (elements.length < math.abs(ordinal)) {
true
if (failOnError) false else true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: !failOnFalse

if ($index >= $eval1.numElements() || $index < 0$nullCheck) {
if ($index >= $eval1.numElements() || $index < 0) {
$failOnErrorBranch
} else if (false$nullCheck) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false$nullCheck -> $nullCheck

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can get rid of the entire else if (false$nullCheck) if containsNull == false.

usage = "_FUNC_(n, input1, input2, ...) - Returns the `n`-th input, e.g., returns `input2` when `n` is 2.",
usage = """
_FUNC_(n, input1, input2, ...) - Returns the `n`-th input, e.g., returns `input2` when `n` is 2.
If the index exceeds the length of the array, Returns NULL if Ansi mode is off;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, Ansi -> ANSI

@@ -2146,7 +2146,9 @@ object SQLConf {
.doc("When true, Spark tries to conform to the ANSI SQL specification: 1. Spark will " +
"throw a runtime exception if an overflow occurs in any operation on integral/decimal " +
"field. 2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in " +
"the SQL parser.")
"the SQL parser. 3. Spark will returns null for null input for function `size`. " +
"4. Spark will throw ArrayIndexOutOfBoundsException if invalid indices " +
Copy link
Contributor

@cloud-fan cloud-fan Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge 1 and 4: 1. Spark will throw an exception at runtime if the inputs to a SQL operator/function are invalid, e.g. overflow in arithmetic operations, out-of-range index when accessing array elements.


// CreateArray case invalid indices
assert(!ElementAt(array, Literal(0)).nullable)
if (ansiEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert(ElementAt(array, Literal(4)).nullable == !ansiEnabled)


// GetArrayStructFields case invalid indices
assert(!ElementAt(stArray3, Literal(0)).nullable)
if (ansiEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -3621,6 +3621,40 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
df.select(map(map_entries($"m"), lit(1))),
Row(Map(Seq(Row(1, "a")) -> 1)))
}

test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not mix unit test and end-to-end test. Let's add UT in CollectionExpressionsSuite(for ElementAt) and StringExpressionsSuite (for Elt).

For end-to-end test, we can put them in DataFrameSuite for all the three functions/operators.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35475/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35475/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130868 has finished for PR 30297 at commit 7dcba56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2567,6 +2567,97 @@ class DataFrameSuite extends QueryTest
val df = l.join(r, $"col2" === $"col4", "LeftOuter")
checkAnswer(df, Row("2", "2"))
}

test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can we follow datetime.sql, which is run twice with and without ANSI mode? array.sql may be a good place to put the new end-to-end tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @cloud-fan 's suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to array.sql, SPARK-33391 will be removed accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me too.

@gengliangwang
Copy link
Member

@leanken could you update the PR description. I think there is user-facing change when ANSI mode is enabled.

@@ -1883,4 +1887,32 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
Literal(stringToInterval("interval 1 year"))),
Seq(Date.valueOf("2018-01-01")))
}

test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, SPARK-33386 is better than SPARK-33391 because this test name has ArrayIndexOutOfBoundsException.

@@ -62,6 +62,29 @@ class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper {
checkEvaluation(GetArrayItem(nestedArray, Literal(0)), Seq(1))
}

test("SPARK-33391: GetArrayItem ArrayIndexOutOfBoundsException") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto SPARK-33386.

@@ -968,4 +968,34 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
GenerateUnsafeProjection.generate(
Sentences(Literal("\"quote"), Literal("\"quote"), Literal("\"quote")) :: Nil)
}

test("SPARK-33391: elt ArrayIndexOutOfBoundsException") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto SPARK-33386.

@@ -2042,10 +2053,17 @@ case class ElementAt(left: Expression, right: Expression)
} else {
""
}

val failOnErrorBranch = if (failOnError) {
s"""throw new ArrayIndexOutOfBoundsException("Invalid index: " + $index);""".stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove .stripMargin because this is a single line?

} else {
""
}

val failOnErrorBranch = if (failOnError) {
s"""throw new ArrayIndexOutOfBoundsException("Invalid index: " + $index);""".stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove .stripMargin because this is a single line?

accesses elements from the last to the first. The function returns NULL
if the index exceeds the length of the array and `spark.sql.ansi.enabled` is set to false.
If `spark.sql.ansi.enabled` is set to true, it throws ArrayIndexOutOfBoundsException
for invalid indices.

_FUNC_(map, key) - Returns value for given key, or NULL if the key is not contained in the map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make ElementAtbehavior consistent on map type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's about to support ANSI mode for map type in next PR.

@@ -2008,7 +2015,11 @@ case class ElementAt(left: Expression, right: Expression)
val array = value.asInstanceOf[ArrayData]
val index = ordinal.asInstanceOf[Int]
if (array.numElements() < math.abs(index)) {
null
if (failOnError) {
throw new ArrayIndexOutOfBoundsException(s"Invalid index: $index")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we include the total number of elements too in the error message? sometimes that is helpful for debugging.

"throw an exception at runtime if the inputs to a SQL operator/function are invalid, " +
"e.g. overflow in arithmetic operations, out-of-range index when accessing array elements. " +
"2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in " +
"the SQL parser. 3. Spark will returns null for null input for function `size`.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will returns -> will return, and perhaps capital NULL?

2. move DataFrameSuite code into array.sql and ansi/array.sql
3. add numElements to exception message.
4. other code refine.

Change-Id: Ieb322ed7b036fc3322fd3b814c8508bfef266378
@leanken-zz
Copy link
Contributor Author

@sunchao @dongjoon-hyun @cloud-fan updated, resolved your comments.

1. fix GetArrayItem nullablity issue.
2. move DataFrameSuite code into array.sql and ansi/array.sql
3. add numElements to exception message.
4. other code refine.

@leanken-zz
Copy link
Contributor Author

@leanken could you update the PR description. I think there is user-facing change when ANSI mode is enabled.

updated.

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35568/

} else {
""
}

val failOnErrorBranch = if (failOnError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it should be indexOutOfBoundBranch

@@ -2042,10 +2039,20 @@ case class ElementAt(left: Expression, right: Expression)
} else {
""
}

val failOnErrorBranch = if (failOnError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it should be indexOutOfBoundBranch

@@ -323,6 +338,17 @@ case class Elt(children: Seq[Expression]) extends Expression {
""".stripMargin
}.mkString)

val failOnErrorBranch = if (failOnError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it should be indexOutOfBoundBranch

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35568/

Change-Id: I719078352ab04f3ab096e62fd2b6a1c06bbdffd3
@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35574/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35574/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130962 has finished for PR 30297 at commit 36d235a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130968 has finished for PR 30297 at commit 4b161a4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

GA passed, merging to master, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants