-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound #30297
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130781 has finished for PR 30297 at commit
|
cc @maropu |
I think we need to update the ANSI doc accordingly if we change the behaviours: https://github.com/apache/spark/blame/master/docs/sql-ref-ansi-compliance.md#L110-L114 btw, is this change related to ANSI? The ANSI standard says something about these kinds of behaviours? @gengliangwang |
Also, its better to update the doc of the ANSI config (e.g., spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Lines 2145 to 2149 in 90f6f39
|
} | ||
assert(ex.getMessage.contains("Invalid index: 4")) | ||
} else { | ||
checkAnswer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you put this in a single line?
checkAnswer(df, Row(null))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for -1
or 0
?
} | ||
assert(ex.getMessage.contains("Invalid index: 5")) | ||
} else { | ||
checkAnswer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add fine-grained tests for element_at
and elt
via checkExceptionInExpression
?
@@ -1919,9 +1919,14 @@ case class ArrayPosition(left: Expression, right: Expression) | |||
b | |||
""", | |||
since = "2.4.0") | |||
case class ElementAt(left: Expression, right: Expression) | |||
case class ElementAt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the usage above of ExpressionDescription
, too?
@@ -239,7 +240,11 @@ case class ConcatWs(children: Seq[Expression]) | |||
""", | |||
since = "2.0.0") | |||
// scalastyle:on line.size.limit | |||
case class Elt(children: Seq[Expression]) extends Expression { | |||
case class Elt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq(true, false).foreach { ansiEnabled => | ||
withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) { | ||
val typeA = ArrayType(StringType) | ||
val array = Literal.create(Seq("a", "b"), typeA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: val array = Literal.create(Seq("a", "b"), ArrayType(StringType))
?
val ex = intercept[Exception] { | ||
checkEvaluation(GetArrayItem(array, Literal(5)), null) | ||
} | ||
assert(stackTraceToString(ex).contains("Invalid index: 5")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for -1
?
|
||
if (ansiEnabled) { | ||
val ex = intercept[Exception] { | ||
checkEvaluation(GetArrayItem(array, Literal(5)), null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use checkExceptionInExpression
instead?
Change-Id: If7650cc45aa30fd3d6549536a8af6ca01a746c39
2. add more UT scenario. 3. update nullability calculation for ElementAt and update UT. Change-Id: Ieec69fe5171dfee2863ad93f84bb660076214de0
d1e0ae3
to
7dcba56
Compare
@maropu updated with all your comments, plz have a look when you have time. ^_^ |
accesses elements from the last to the first. Returns NULL if the index exceeds the length | ||
of the array. | ||
accesses elements from the last to the first. If the index exceeds the length of the array, | ||
Returns NULL if Ansi mode is off; Throws ArrayIndexOutOfBoundsException when Ansi mode is on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ansi
-> ANSI
@@ -1969,7 +1974,7 @@ case class ElementAt(left: Expression, right: Expression) | |||
if (ordinal == 0) { | |||
false | |||
} else if (elements.length < math.abs(ordinal)) { | |||
true | |||
if (failOnError) false else true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: !failOnFalse
if ($index >= $eval1.numElements() || $index < 0$nullCheck) { | ||
if ($index >= $eval1.numElements() || $index < 0) { | ||
$failOnErrorBranch | ||
} else if (false$nullCheck) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false$nullCheck
-> $nullCheck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can get rid of the entire else if (false$nullCheck)
if containsNull == false
.
usage = "_FUNC_(n, input1, input2, ...) - Returns the `n`-th input, e.g., returns `input2` when `n` is 2.", | ||
usage = """ | ||
_FUNC_(n, input1, input2, ...) - Returns the `n`-th input, e.g., returns `input2` when `n` is 2. | ||
If the index exceeds the length of the array, Returns NULL if Ansi mode is off; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, Ansi -> ANSI
@@ -2146,7 +2146,9 @@ object SQLConf { | |||
.doc("When true, Spark tries to conform to the ANSI SQL specification: 1. Spark will " + | |||
"throw a runtime exception if an overflow occurs in any operation on integral/decimal " + | |||
"field. 2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in " + | |||
"the SQL parser.") | |||
"the SQL parser. 3. Spark will returns null for null input for function `size`. " + | |||
"4. Spark will throw ArrayIndexOutOfBoundsException if invalid indices " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge 1 and 4: 1. Spark will throw an exception at runtime if the inputs to a SQL operator/function are invalid, e.g. overflow in arithmetic operations, out-of-range index when accessing array elements.
|
||
// CreateArray case invalid indices | ||
assert(!ElementAt(array, Literal(0)).nullable) | ||
if (ansiEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assert(ElementAt(array, Literal(4)).nullable == !ansiEnabled)
|
||
// GetArrayStructFields case invalid indices | ||
assert(!ElementAt(stArray3, Literal(0)).nullable) | ||
if (ansiEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -3621,6 +3621,40 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession { | |||
df.select(map(map_entries($"m"), lit(1))), | |||
Row(Map(Seq(Row(1, "a")) -> 1))) | |||
} | |||
|
|||
test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not mix unit test and end-to-end test. Let's add UT in CollectionExpressionsSuite
(for ElementAt) and StringExpressionsSuite
(for Elt).
For end-to-end test, we can put them in DataFrameSuite
for all the three functions/operators.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #130868 has finished for PR 30297 at commit
|
@@ -2567,6 +2567,97 @@ class DataFrameSuite extends QueryTest | |||
val df = l.join(r, $"col2" === $"col4", "LeftOuter") | |||
checkAnswer(df, Row("2", "2")) | |||
} | |||
|
|||
test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we follow datetime.sql
, which is run twice with and without ANSI mode? array.sql
may be a good place to put the new end-to-end tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @cloud-fan 's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move to array.sql
, SPARK-33391
will be removed accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me too.
@leanken could you update the PR description. I think there is user-facing change when ANSI mode is enabled. |
@@ -1883,4 +1887,32 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper | |||
Literal(stringToInterval("interval 1 year"))), | |||
Seq(Date.valueOf("2018-01-01"))) | |||
} | |||
|
|||
test("SPARK-33391: element_at ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, SPARK-33386
is better than SPARK-33391
because this test name has ArrayIndexOutOfBoundsException
.
@@ -62,6 +62,29 @@ class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
checkEvaluation(GetArrayItem(nestedArray, Literal(0)), Seq(1)) | |||
} | |||
|
|||
test("SPARK-33391: GetArrayItem ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto SPARK-33386
.
@@ -968,4 +968,34 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
GenerateUnsafeProjection.generate( | |||
Sentences(Literal("\"quote"), Literal("\"quote"), Literal("\"quote")) :: Nil) | |||
} | |||
|
|||
test("SPARK-33391: elt ArrayIndexOutOfBoundsException") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto SPARK-33386
.
@@ -2042,10 +2053,17 @@ case class ElementAt(left: Expression, right: Expression) | |||
} else { | |||
"" | |||
} | |||
|
|||
val failOnErrorBranch = if (failOnError) { | |||
s"""throw new ArrayIndexOutOfBoundsException("Invalid index: " + $index);""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove .stripMargin
because this is a single line?
} else { | ||
"" | ||
} | ||
|
||
val failOnErrorBranch = if (failOnError) { | ||
s"""throw new ArrayIndexOutOfBoundsException("Invalid index: " + $index);""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove .stripMargin
because this is a single line?
accesses elements from the last to the first. The function returns NULL | ||
if the index exceeds the length of the array and `spark.sql.ansi.enabled` is set to false. | ||
If `spark.sql.ansi.enabled` is set to true, it throws ArrayIndexOutOfBoundsException | ||
for invalid indices. | ||
|
||
_FUNC_(map, key) - Returns value for given key, or NULL if the key is not contained in the map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make ElementAt
behavior consistent on map type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's about to support ANSI mode for map type in next PR.
@@ -2008,7 +2015,11 @@ case class ElementAt(left: Expression, right: Expression) | |||
val array = value.asInstanceOf[ArrayData] | |||
val index = ordinal.asInstanceOf[Int] | |||
if (array.numElements() < math.abs(index)) { | |||
null | |||
if (failOnError) { | |||
throw new ArrayIndexOutOfBoundsException(s"Invalid index: $index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we include the total number of elements too in the error message? sometimes that is helpful for debugging.
"throw an exception at runtime if the inputs to a SQL operator/function are invalid, " + | ||
"e.g. overflow in arithmetic operations, out-of-range index when accessing array elements. " + | ||
"2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in " + | ||
"the SQL parser. 3. Spark will returns null for null input for function `size`.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will returns -> will return, and perhaps capital NULL
?
2. move DataFrameSuite code into array.sql and ansi/array.sql 3. add numElements to exception message. 4. other code refine. Change-Id: Ieb322ed7b036fc3322fd3b814c8508bfef266378
@sunchao @dongjoon-hyun @cloud-fan updated, resolved your comments.
|
updated. |
Kubernetes integration test starting |
} else { | ||
"" | ||
} | ||
|
||
val failOnErrorBranch = if (failOnError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be indexOutOfBoundBranch
@@ -2042,10 +2039,20 @@ case class ElementAt(left: Expression, right: Expression) | |||
} else { | |||
"" | |||
} | |||
|
|||
val failOnErrorBranch = if (failOnError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be indexOutOfBoundBranch
@@ -323,6 +338,17 @@ case class Elt(children: Seq[Expression]) extends Expression { | |||
""".stripMargin | |||
}.mkString) | |||
|
|||
val failOnErrorBranch = if (failOnError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be indexOutOfBoundBranch
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #130962 has finished for PR 30297 at commit
|
Test build #130968 has finished for PR 30297 at commit
|
GA passed, merging to master, thanks! |
What changes were proposed in this pull request?
Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for
element_at
,elt
,GetArrayItem
functions.Why are the changes needed?
For ansiMode.
Does this PR introduce any user-facing change?
When
spark.sql.ansi.enabled
= true, Spark will throwArrayIndexOutOfBoundsException
if out-of-range index when accessing array elementsHow was this patch tested?
Added UT and existing UT.