Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register "passthrough" UDFs with correct ordinal return type #541

Merged

Conversation

KevinGe00
Copy link
Contributor

@KevinGe00 KevinGe00 commented Oct 9, 2024

What changes are proposed in this pull request, and why are they necessary?

"Passthrough" UDFs are functions that take in a column as input, perform certain edits on the input fields (such as setting a field to NULL), but ultimately returns a column of the same schema and nullabilities as the input column. Prior to this PR, Coral was not reserving the input column nullabillities of these (new and spark native) passthrough UDFs when inferring the schema. This is was because these UDFs were registered with the return type ARG1 instead of OrdinalReturnTypeInferenceV2(1). While these two types are semantically the same, both representing a type inference strategy whereby the result type of a call is the type of the operand 1 (0-based), coral-schema is only set up to handle
OrdinalReturnTypeInferenceV2 and not ARG1, see here.

How was this patch tested?

  • New unit test, thanks @rzhang10
  • clean build
  • regression tests pending
  • integration tests in spark-shell

@KevinGe00 KevinGe00 changed the title Correct "passthrough" UDFs registered with the wrong ordinal return type Register "passthrough" UDFs with correct ordinal return type Oct 9, 2024
@rzhang10
Copy link
Member

rzhang10 commented Oct 9, 2024

@KevinGe00 thanks for the PR.

I think technically this looks a hacky solution to me, because the OrdinalReturnTypeInferenceV2 is a re-definition of the OrdinalReturnTypeInference without any real differences in semantics, it only just creates a new type for the sake of type distinction (This is a very typical code smell). So, this works only for functions that have OrdinalReturnTypeInferenceV2 typed inferences, but not OrdinalReturnTypeInference typed ones, which is weird. Also, I imagine there are functions that are more complex, and their return type inference has more complex rules for nullability, that can't be just described usingOrdinalReturnTypeInferencealone. Imagine a semantics such as:

func(arg1, arg2), and the return type is nullable only when both arg1 and arg2 are nullable.

My high-level point is the previous usage of ARG1 in StaticHiveFunctionRegistry is not wrong, it's the coral-schema doesn't have a systematic way to handle UDF type inferences regarding nullablility in general.

I can approve this for now as a workaround solution, but I think we do need to realize coral-schema lacks a systematic way to handle nullability in its current implementation.

@KevinGe00
Copy link
Contributor Author

KevinGe00 commented Oct 10, 2024

OrdinalReturnTypeInferenceV2 is a re-definition of the OrdinalReturnTypeInference without any real differences in semantics, it only just creates a new type for the sake of type distinction (This is a very typical code smell)

The reason OrdinalReturnTypeInferenceV2 was created isn't only for sake of type distinction, we needed it since OrdinalReturnTypeInference cannot be directly used because the private field 'ordinal' has no public accessor method --- the ordinal is needed for coral to return the correct return type. This was discussed in this thread. Furthermore, the practice of breaking Calcite's encapsulation is used (and accepted) throughout Coral, CoralJavaTypeFactoryImpl is another example.

So, this works only for functions that have OrdinalReturnTypeInferenceV2 typed inferences, but not OrdinalReturnTypeInference typed ones, which is weird.

I agree Coral deriving nullabilities correctly for OrdinalReturnTypeInferenceV2 and not for OrdinalReturnTypeInference is misleading given they are semantically the same return type. This arises from interchangeably using OrdinalReturnTypeInference (represented as ARG1) and OrdinalReturnTypeInference when the conditional that is supposed to handle these do not catch both cases. What we can do is implicitly enforce OrdinalReturnTypeInferenceV2 as the official ordinal return type strategy in Coral by renaming to CoralOrdinalReturnTypeInference, this also aligns with the Coral IR refactoring work as we move away from StaticHiveFunctionRegistry to something like CoralFunctionRegistry. Note that remaining usages of ARG1 still exists even after this PR, this is to maintain backwards compatible for those UDFs that still use it.

Also, I imagine there are functions that are more complex, and their return type inference has more complex rules for nullability, that can't be just described using OrdinalReturnTypeInference alone.

In cases where the return type inference has more complex rules, we have written classes that conditionally determine the return type given the input. See COALESCE_STRUCT_FUNCTION_RETURN_STRATEGY for an example.

coral-schema lacks a systematic way to handle nullability in its current implementation

coral-schema does have the ability to infer nullabillities, the current state is that all the inner fields created by operator are set as nullable in alignment with Spark's expectations, #369 . However, as this comment suggests, Coral needs a way to determine the engine the UDF was implemented in (i.e Hive UDF or Spark UDF) after the introduction of Spark native UDF support in Coral --- this may necessitate a wider discussion.

@rzhang10
Copy link
Member

@KevinGe00 I see you mentioned #369 , but Spark expects all the inner fields created by operator to be nullable.

However, doesn't the case presented by this PR serve a counterexample to that statement?

@KevinGe00 KevinGe00 merged commit c51d3b5 into linkedin:master Oct 17, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants