-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor signature #22732
Conversation
extends Expression with ImplicitCastInputTypes with NonSQLExpression with UserDefinedExpression { | ||
|
||
// The constructor for SPARK 2.1 and 2.2 | ||
def this( | ||
function: AnyRef, | ||
dataType: DataType, | ||
children: Seq[Expression], | ||
handleNullForInputs: Seq[Boolean], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to change this constructor, I believe. It's preserved for backwards-compatibility. It should pass an empty Seq or Nil for the new param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized this is different from you original PR already, @cloud-fan did a follow-up PR adding the "nullableTypes". I'll revert the change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just remove this constructor. It's weird to keep backward compatibility for a private class, and I don't think it can work anymore. It's not OK to omit the nullable info.
@@ -31,6 +31,7 @@ import org.apache.spark.sql.types.DataType | |||
* null. Use boxed type or [[Option]] if you wanna do the null-handling yourself. | |||
* @param dataType Return type of function. | |||
* @param children The input expressions of this UDF. | |||
* @param handleNullForInputs Whether the inputs need null-value handling respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe give a quick example to clarify? like "For example, primitive types can't take on the value null, and need special handling."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it needs a little explanation here regarding the primitive types. Since the types are not nullable, when the values are null, it usually ends up being represented as a zero value. I should add it to the java doc here.
val expected2 = | ||
If(IsNull(double), nullResult, udf2.copy(children = string :: KnownNotNull(double) :: Nil)) | ||
If(IsNull(double), nullResult, udf2.copy(handleNullForInputs = false :: false :: Nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking my understanding -- string gets false because it's a reference type, and a double known to not be null also doesn't need null checking, so gets false. OK that sounds right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as explained in #22732 (comment)
} | ||
|
||
Column(ScalaUDF( | ||
f, | ||
dataType, | ||
exprs.map(_.expr), | ||
handleNullForInputs.getOrElse(exprs.map(_ => false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not specified, assume nothing needs special null handling. OK, check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not so sure about this part, but this is just to be consistent with the behavior in your previous check-in. Can you give an example of such end-user cases where these flags are unavailable/not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of a case. I suppose that's fine default behavior, and matches what happened in the previous change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually assume the default behavior should be the other way around. If we don't know, we just do the if-else null handling and it wouldn't do us any harm correctness-wise, right? Anyway I'm not gonna change that in this PR but hope we can get an idea if and when this default behavior will happen.
| ScalaUDF(func, dataType, e, inputSchemas.map(_.map(_.dataType)).getOrElse(Nil), Some(name), nullable, | ||
| udfDeterministic = true, nullableTypes = inputSchemas.map(_.map(_.nullable)).getOrElse(Nil)) | ||
| ScalaUDF(func, dataType, e, | ||
| inputSchemas.map(_.map(!_.nullable)).getOrElse(e.map(_ => false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing !nullable instead of nullable, yep, looks right. The meaning is flipped here, correct.
.getOrElse(udf) | ||
} | ||
case udf @ ScalaUDF(_, _, inputs, handleNullForInputs, _, _, _, _) | ||
if !handleNullForInputs.forall(!_) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about if handleNullForInputs.exists(_)
? if I understand the meaning right, that could be simpler. If anything is true you have to trigger the handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
|
||
// TODO: skip null handling for not-nullable primitive inputs after we can completely | ||
// trust the `nullable` information. | ||
val inputsNullCheck = handleNullForInputs.zip(inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO above seems to apply to the line that was removed? I'm not totally clear. It's OK to not check !expr.isInstanceOf[KnownNotNull]
now? you know this better than I. Hey if tests pass I think it proves it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should answer/confirm a couple of other questions above:
Since we already have this flag handleNullForInputs
in ScalaUDF
, we can take advantage of it in this rule as well. Say, the first time a ScalaUDF
hits this rule with handleNullForInputs
as "false, true, false", then we add a null-handling (if clause) for the second input which is flagged with "true", and from this point on we are all good with all inputs, and we can flag the new ScalaUDF
's handleNullForInputs
as all "false". So even if the new ScalaUDF
hits this rule for a second time, nothing will be done.
It should work the same way for the "TODO" above, if handleNullForInputs
has a "true" flag and the corresponding expression is NOT nullable, we can skip the null handling while flagging it as "false" in the new ScalaUDF
in the end.
Test build #97409 has finished for PR 22732 at commit
|
*/ | ||
case class ScalaUDF( | ||
function: AnyRef, | ||
dataType: DataType, | ||
children: Seq[Expression], | ||
handleNullForInputs: Seq[Boolean], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something but:
- Why don't we just merge
handleNullForInputs
andinputTypes
? - Why
handleNullForInputs
is required whereasinputTypes
's default isNil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding handleNullForInputs
doesn't look reducing confusion a lot to me. Since this PR targets only refactoring mainly to reduce confusion and the easy of use, this concern should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be merged. I guess the preference is to minimize the change required, so we just have a new field.
That said, UDFs now really require nullability information to work correctly in Scala 2.12. This is the reason the new field is required and not optional. It already caused a new test to fail, so I'm more persuaded that it's OK to make a less backwards-compatible change to make it clear to any future callers that new info is needed. It's an internal class, so reasonable to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among all other reasons, one argument for not merging the flags with the types is #22732 (comment).
As to your second question, it would be for same reason as above, plus #22259 (comment).
@@ -31,6 +31,7 @@ import org.apache.spark.sql.types.DataType | |||
* null. Use boxed type or [[Option]] if you wanna do the null-handling yourself. | |||
* @param dataType Return type of function. | |||
* @param children The input expressions of this UDF. | |||
* @param handleNullForInputs Whether the inputs need null-value handling respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about acceptNullInputs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Since I'm also using this flag to replace KnownNotNull
, I think inputsNullSafe
would be a better name?
@@ -179,7 +179,8 @@ class DynamicPartitionDataWriter( | |||
val partitionName = ScalaUDF( | |||
ExternalCatalogUtils.getPartitionPathString _, | |||
StringType, | |||
Seq(Literal(c.name), Cast(c, StringType, Option(description.timeZoneId)))) | |||
Seq(Literal(c.name), Cast(c, StringType, Option(description.timeZoneId))), | |||
Seq(false, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq(false, false)) | |
handleNullForInputs = Seq(false, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need to do that for required arguments, or we do?
@@ -81,11 +81,11 @@ case class UserDefinedFunction protected[sql] ( | |||
f, | |||
dataType, | |||
exprs.map(_.expr), | |||
nullableTypes.map(_.map(!_)).getOrElse(exprs.map(_ => false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be more conservative here. The default behavior needs to handle NULLs. We should fix it.
// Need to handle NULL when the fields can't accept NULL.
val handleNullForInputs = nullableTypes.map(_.map(!_)).getOrElse(Seq.fill(exprs.size)(true))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen @cloud-fan This sounds a new behavior added in Spark 2.4 release by the PR #22259, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's new in the sense that Scala 2.12 forced us to change where we get type information for UDFS, or specifically whether the type is allowed to be null. Before we could infer it directly from the closure, but not in 2.12. This change is supposed to just get the info from elsewhere -- from TypeTags when the UDFs are declared. The change ought to be invisible to users and if done correctly should not result in a behavior change. This follow-up change is more about being conservative and lessening the chance of a mistake (see the test case mary ann found) internally or of a mistake for a library that uses ScalaUDF directly anyway.
For that reason, yeah I agree with your comment here. Default to assuming null checks are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the only place where we'd get a not-specified inputSchemas
is when ScalaReflection.schemaFor
doesn't recognize a type and throws an exception (
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Line 786 in 1fd59c1
throw new UnsupportedOperationException(s"Schema for type $other is not supported") |
val inputSchemas = Try(ScalaReflection.schemaFor(typeTag[A1]) :: ScalaReflection.schemaFor(typeTag[A2]) :: Nil).toOption
val udf = SparkUserDefinedFunction.create(f, dataType, inputSchemas)
It would mean if the type of only one of the parameters is unrecognizable by ScalaReflection
, we'd end up having the entire Seq
as None
. I think it's fine not to check null for user-defined types that we don't know, coz they can't be primitive types anyway, but I do think we should make the type inference of each parameter independent so we do handle the nulls that need to be taken care of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right. There are a number of UDFs in MLlib, etc that have inputs of type "Any", which isn't great, but I wanted to work around rather than change them for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to what I just pointed out, which is when we did try to get inputSchemas
through ScalaReflection.schemaFor
and got an exception for unrecognized types, there's another case where we could get an unspecified nullableTypes
, and that is when UserDefinedFunction
is instantiated calling the constructor but not the create
method.
Then I assume it's created by an earlier version, and we should use the old logic, i.e., ScalaReflection.getParameterTypes
(https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2L2153) to get the correct information for nullableTypes
. Is that right, @cloud-fan @srowen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but we can't use getParameterTypes anymore. It won't work in Scala 2.12. Where the nullability info is definitely not available, be conservative and assume it all needs null handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't get the type information at compile time(e.g. (a: Any) => xxx
), I don't think ScalaReflection.getParameterTypes
can get something useful at runtime. For this case we won't add null check anyway, so I think there is no behavior change.
Test build #97464 has finished for PR 22732 at commit
|
@@ -31,6 +31,9 @@ import org.apache.spark.sql.types.DataType | |||
* null. Use boxed type or [[Option]] if you wanna do the null-handling yourself. | |||
* @param dataType Return type of function. | |||
* @param children The input expressions of this UDF. | |||
* @param handleNullForInputs Whether the inputs need null-value handling, which preserves the null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputsNullSafe
SGTM, let's also rename UserDefinedFunction.nullableTypes
There is an argument about whether #22259 introduced behavior changes. Here is my analysis. Before #22259 , the type and null check was done as
After #22259 , the type and null check is done as
So we may have a behavior change if users register UDFs in a weird way. e.g. they define I'd say this is an invalid use case, because the data type check is also lost. I think we don't need to treat it as a behavior change. |
After more thoughts, there is one case we don't handle well. For UDFs like |
That's a good point, but that was already an issue right? it isn't introduced by this change at least? |
@srowen What @cloud-fan described is a change introduced in #22259. We can fix it by keeping each call to |
At this point I think you both know more than I do about this, so go ahead. To the limits of my understanding it sounds reasonable. |
extends Expression with ImplicitCastInputTypes with NonSQLExpression with UserDefinedExpression { | ||
|
||
// The constructor for SPARK 2.1 and 2.2 | ||
def this( | ||
function: AnyRef, | ||
dataType: DataType, | ||
children: Seq[Expression], | ||
inputsNullSafe: Seq[Boolean], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this constructor is here for backward compatibility. If we have to change it, I don't think we need to keep it anymore. cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not break it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll do the Scala 2.11 approach for such places where nullableTypes info is unavailable so to at least keep legacy usage of ScalaUDF working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea. We should not break our end users if they just use Scala 2.11. In Spark 3.0, we can be more aggressive and break the binary compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Test build #97508 has finished for PR 22732 at commit
|
@@ -73,19 +73,21 @@ case class UserDefinedFunction protected[sql] ( | |||
*/ | |||
@scala.annotation.varargs | |||
def apply(exprs: Column*): Column = { | |||
if (inputTypes.isDefined && nullableTypes.isDefined) { | |||
require(inputTypes.get.length == nullableTypes.get.length) | |||
val numOfArgs = ScalaReflection.getParameterCount(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are here, shall we also check exprs.length == numOfArgs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the "assert" should better be in ScalaUDF
, but then it would be slightly different from the current behavior whatsoever. I'm keeping other things as they are.
@@ -393,4 +393,30 @@ class UDFSuite extends QueryTest with SharedSQLContext { | |||
checkAnswer(df, Seq(Row("12"), Row("24"), Row("3null"), Row(null))) | |||
} | |||
} | |||
|
|||
test("SPARK-25044 Verify null input handling for primitive types - with udf()") { | |||
val udf1 = udf({(x: Long, y: Any) => x * 2 + (if (y == null) 1 else 0)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit udf((x: Long, y: Any) => x * 2 + (if (y == null) 1 else 0))
|
||
test("SPARK-25044 Verify null input handling for primitive types - with udf.register") { | ||
withTable("t") { | ||
sql("create table t(a varchar(10), b int, c varchar(10)) using parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: upper case for keywords
docs/sql-programming-guide.md
Outdated
@@ -1978,6 +1978,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |||
- Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. | |||
- Since Spark 2.4, The LOAD DATA command supports wildcard `?` and `*`, which match any one character, and zero or more characters, respectively. Example: `LOAD DATA INPATH '/tmp/folder*/'` or `LOAD DATA INPATH '/tmp/part-?'`. Special Characters like `space` also now work in paths. Example: `LOAD DATA INPATH '/tmp/folder name/'`. | |||
- In Spark version 2.3 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true` and returns 10 rows. This violates SQL standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row. To restore the previous behavior, set `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` to `true`. | |||
- Since Spark 2.4, use of the method `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` should not expect any automatic null handling of the input parameters, thus a null input of a Scala primitive type will be converted to the type's corresponding default value in the UDF. All other UDF declaration and registration methods remain the same behavior as before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this target to backport to 2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this fixes a bug introduced by a commit in 2.4
* Returns the nullability of the input parameter types of the scala function object. | ||
* | ||
* Note that this only works with Scala 2.11, and the information returned may be inaccurate if | ||
* used with a different Scala version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we explicitly return seq of true if it's not scala 2.11? Then the behavior is more predictable than may be inaccurate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it simply returns if going through the below code path. I should probably make the java doc clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument here is it's not necessarily wrong if using scala 2.12. if all inputs are of boxed types, then it can still be good. I think it's just enough to say "we don't support it. switch to the new interface otherwise we can't guarantee correctness."
7a6b2e1
to
84cb456
Compare
Test build #97578 has finished for PR 22732 at commit
|
Test build #97574 has finished for PR 22732 at commit
|
Test build #97573 has finished for PR 22732 at commit
|
For Scala 2.11, we should not introduce any behavior change and also keep binary and source compatibility.
For Scala 2.12, since we are unable to know the type nullability in a few APIs, we issue a warning message in these cases. Below is the example which will generate a wrong answer due to the primitive data types:
In Spark 3.0, we should have a follow-up PR to block these APIs. In Spark 2.4, the support of Scala 2.12 is beta. |
Test build #97587 has finished for PR 22732 at commit
|
Test build #97581 has finished for PR 22732 at commit
|
retest this please |
LGTM pending Jenkins |
Test build #97601 has finished for PR 22732 at commit
|
thanks, merging to master/2.4! |
## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e816776) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This is a follow-up PR for apache#22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. apache#22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after apache#22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes apache#22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? In apache#22732 , we tried our best to keep the behavior of Scala UDF unchanged in Spark 2.4. However, since Spark 3.0, Scala 2.12 is the default. The trick that was used to keep the behavior unchanged doesn't work with Scala 2.12. This PR proposes to remove the Scala 2.11 hack, as it's not useful. ## How was this patch tested? existing tests. Closes apache#23498 from cloud-fan/udf. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This is a follow-up PR for #22259. The extra field added in
ScalaUDF
with the original PR was declared optional, but should be indeed required, otherwise callers ofScalaUDF
's constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name tohandleNullForInputs
.This PR also restores the original behavior for null-handling of primitive-type input parameters, which was broken by #22259. For example, for
val f = udf({(x: Int, y: Any) => x})
,f(null, "str")
should returnnull
but would return0
after #22259.In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule
HandleNullInputsForUDF
being applied infinitely.How was this patch tested?
Added UT in UDFSuite
Passed affected existing UTs:
AnalysisSuite
UDFSuite