Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 #22259

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Aug 28, 2018

What changes were proposed in this pull request?

Alternative take on #22063 that does not introduce udfInternal.
Resolve issue with inferring func types in 2.12 by instead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive)

How was this patch tested?

Existing tests.

@srowen
Copy link
Member Author

srowen commented Aug 28, 2018

@cloud-fan you might be interested in this. This is a simpler version that doesn't use udfInternal. This is just about the nullability issue, really. It leaves the current behavior, that schema inference failure is quietly ignored.

@SparkQA
Copy link

SparkQA commented Aug 29, 2018

Test build #95387 has finished for PR 22259 at commit d99cfa6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 29, 2018

LGTM. How did you work around the type tag not found issue? nvm, I saw your comments in the previous PR.

@SparkQA
Copy link

SparkQA commented Aug 29, 2018

Test build #4297 has finished for PR 22259 at commit d99cfa6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

inputsNullCheck
.map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs)))
.getOrElse(udf)
case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we restore the spaces as in the original?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah missed it. We can clean it up later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing it - what is the space issue? There's an additional level of indent because of the if statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is in udf@ScalaUDF which should have been udf @ ScalaUDF

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. Yeah I didn't mean to change that. It's minor enough to leave I think. (or else standardize across the code)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 1fd59c1 Aug 29, 2018
@srowen srowen changed the title [WIP][SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 [SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 Sep 3, 2018
@@ -47,7 +48,8 @@ case class ScalaUDF(
inputTypes: Seq[DataType] = Nil,
udfName: Option[String] = None,
nullable: Boolean = true,
udfDeterministic: Boolean = true)
udfDeterministic: Boolean = true,
nullableTypes: Seq[Boolean] = Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Nil as the default value is dangerous. We even do not have an assert to ensure it is set. We could easily miss the setting without the right values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We put the assert in the rule HandleNullInputsForUDF.

can we merge inputTypes and nullableTypes here so that we don't need to worry about it any more? cc @srowen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here was again that we wanted to avoid changing the binary signature. I know catalyst is effectively private to Spark, but this wasn't marked specifically private; I wondered if it would actually affect callers? If not we can go back and merge it.

Nil is just an empty list; I don't think it's dangerous and it is used above in inputTypes. It is not always set, because it's not always possible to infer the schema, let alone nullability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is more about the way we handle nullableTypes if not specified as in https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R2157. The test failure of https://github.com/apache/spark/pull/21851/files#diff-e8dddba2915a147970654aa93bee30a7R344 would have been exposed if the nullableTypes had been updated in this PR. So I would say logically this parameter is required, but right now it is declared optional. In this case, things went wrong when nullableTypes was left unspecified, and this could happen not only with tests but in "source" too. I suggest we move this parameter up right after inputTypes so it can get the attention it needs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are saying that some UDF needed to declare nullable types but didn't? I made the param optional to try to make 'migration' easier and avoid changing the signature much. But, the test you point to, doesn't it pass? are you saying it should not pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I could see an argument that it need not block release. The functionality works as intended, at least.

Would you change it again in 2.4.1? If not then we decide to just keep this behavior. Let's say at least get this in if there is a new RC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I could see an argument that it need not block release. The functionality works as intended, at least.

Would you change it again in 2.4.1? If not then we decide to just keep this behavior. Let's say at least get this in if there is a new RC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's mostly about maintainability. We should definitely update the code as @maryannxue said, so that ScalaUDF is easier to use and not that error-prone. I feel we don't need to backport it, as it's basically code refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryannxue Please submit a PR to make the parameter nullableTypes required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like we should just make these changes after all, and for 2.4, as we need another RC.

asfgit pushed a commit that referenced this pull request Sep 5, 2018
## What changes were proposed in this pull request?

This is a followup of #22259 .

Scala case class has a wide surface: apply, unapply, accessors, copy, etc.

In #22259 , we change the type of `UserDefinedFunction.inputTypes` from `Option[Seq[DataType]]` to `Option[Seq[Schema]]`. This breaks backward compatibility.

This PR changes the type back, and use a `var` to keep the new nullable info.

## How was this patch tested?

N/A

Closes #22319 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@srowen srowen deleted the SPARK-25044.2 branch September 6, 2018 17:50
asfgit pushed a commit that referenced this pull request Oct 19, 2018
## What changes were proposed in this pull request?

This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`.

#22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide.

In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely.

## How was this patch tested?
Added UT in UDFSuite

Passed affected existing UTs:
AnalysisSuite
UDFSuite

Closes #22732 from maryannxue/spark-25044-followup.

Lead-authored-by: maryannxue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e816776)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 19, 2018
## What changes were proposed in this pull request?

This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`.

#22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide.

In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely.

## How was this patch tested?
Added UT in UDFSuite

Passed affected existing UTs:
AnalysisSuite
UDFSuite

Closes #22732 from maryannxue/spark-25044-followup.

Lead-authored-by: maryannxue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This is a follow-up PR for apache#22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`.

apache#22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after apache#22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide.

In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely.

## How was this patch tested?
Added UT in UDFSuite

Passed affected existing UTs:
AnalysisSuite
UDFSuite

Closes apache#22732 from maryannxue/spark-25044-followup.

Lead-authored-by: maryannxue <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants