Register "passthrough" UDFs with correct ordinal return type #541

KevinGe00 · 2024-10-09T19:48:44Z

What changes are proposed in this pull request, and why are they necessary?

"Passthrough" UDFs are functions that take in a column as input, perform certain edits on the input fields (such as setting a field to NULL), but ultimately returns a column of the same schema and nullabilities as the input column. Prior to this PR, Coral was not reserving the input column nullabillities of these (new and spark native) passthrough UDFs when inferring the schema. This is was because these UDFs were registered with the return type ARG1 instead of OrdinalReturnTypeInferenceV2(1). While these two types are semantically the same, both representing a type inference strategy whereby the result type of a call is the type of the operand 1 (0-based), coral-schema is only set up to handle
OrdinalReturnTypeInferenceV2 and not ARG1, see here.

How was this patch tested?

New unit test, thanks @rzhang10
clean build
regression tests pending
integration tests in spark-shell

rzhang10 · 2024-10-09T23:17:23Z

@KevinGe00 thanks for the PR.

I think technically this looks a hacky solution to me, because the OrdinalReturnTypeInferenceV2 is a re-definition of the OrdinalReturnTypeInference without any real differences in semantics, it only just creates a new type for the sake of type distinction (This is a very typical code smell). So, this works only for functions that have OrdinalReturnTypeInferenceV2 typed inferences, but not OrdinalReturnTypeInference typed ones, which is weird. Also, I imagine there are functions that are more complex, and their return type inference has more complex rules for nullability, that can't be just described usingOrdinalReturnTypeInferencealone. Imagine a semantics such as:

func(arg1, arg2), and the return type is nullable only when both arg1 and arg2 are nullable.

My high-level point is the previous usage of ARG1 in StaticHiveFunctionRegistry is not wrong, it's the coral-schema doesn't have a systematic way to handle UDF type inferences regarding nullablility in general.

I can approve this for now as a workaround solution, but I think we do need to realize coral-schema lacks a systematic way to handle nullability in its current implementation.

KevinGe00 · 2024-10-10T21:47:05Z

OrdinalReturnTypeInferenceV2 is a re-definition of the OrdinalReturnTypeInference without any real differences in semantics, it only just creates a new type for the sake of type distinction (This is a very typical code smell)

The reason OrdinalReturnTypeInferenceV2 was created isn't only for sake of type distinction, we needed it since OrdinalReturnTypeInference cannot be directly used because the private field 'ordinal' has no public accessor method --- the ordinal is needed for coral to return the correct return type. This was discussed in this thread. Furthermore, the practice of breaking Calcite's encapsulation is used (and accepted) throughout Coral, CoralJavaTypeFactoryImpl is another example.

So, this works only for functions that have OrdinalReturnTypeInferenceV2 typed inferences, but not OrdinalReturnTypeInference typed ones, which is weird.

I agree Coral deriving nullabilities correctly for OrdinalReturnTypeInferenceV2 and not for OrdinalReturnTypeInference is misleading given they are semantically the same return type. This arises from interchangeably using OrdinalReturnTypeInference (represented as ARG1) and OrdinalReturnTypeInference when the conditional that is supposed to handle these do not catch both cases. What we can do is implicitly enforce OrdinalReturnTypeInferenceV2 as the official ordinal return type strategy in Coral by renaming to CoralOrdinalReturnTypeInference, this also aligns with the Coral IR refactoring work as we move away from StaticHiveFunctionRegistry to something like CoralFunctionRegistry. Note that remaining usages of ARG1 still exists even after this PR, this is to maintain backwards compatible for those UDFs that still use it.

Also, I imagine there are functions that are more complex, and their return type inference has more complex rules for nullability, that can't be just described using OrdinalReturnTypeInference alone.

In cases where the return type inference has more complex rules, we have written classes that conditionally determine the return type given the input. See COALESCE_STRUCT_FUNCTION_RETURN_STRATEGY for an example.

coral-schema lacks a systematic way to handle nullability in its current implementation

coral-schema does have the ability to infer nullabillities, the current state is that all the inner fields created by operator are set as nullable in alignment with Spark's expectations, #369 . However, as this comment suggests, Coral needs a way to determine the engine the UDF was implemented in (i.e Hive UDF or Spark UDF) after the introduction of Spark native UDF support in Coral --- this may necessitate a wider discussion.

rzhang10 · 2024-10-11T18:21:40Z

@KevinGe00 I see you mentioned #369 , but Spark expects all the inner fields created by operator to be nullable.

However, doesn't the case presented by this PR serve a counterexample to that statement?

KevinGe00 and others added 3 commits October 2, 2024 19:59

register spark groot udf with correct return type

f709830

test showcase

7e9cd2c

extend + refactor unit test

2cb43f6

KevinGe00 changed the title ~~Correct "passthrough" UDFs registered with the wrong ordinal return type~~ Register "passthrough" UDFs with correct ordinal return type Oct 9, 2024

rzhang10 approved these changes Oct 16, 2024

View reviewed changes

KevinGe00 merged commit c51d3b5 into linkedin:master Oct 17, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register "passthrough" UDFs with correct ordinal return type #541

Register "passthrough" UDFs with correct ordinal return type #541

KevinGe00 commented Oct 9, 2024 •

edited

Loading

rzhang10 commented Oct 9, 2024 •

edited

Loading

KevinGe00 commented Oct 10, 2024 •

edited

Loading

rzhang10 commented Oct 11, 2024

Register "passthrough" UDFs with correct ordinal return type #541

Register "passthrough" UDFs with correct ordinal return type #541

Conversation

KevinGe00 commented Oct 9, 2024 • edited Loading

What changes are proposed in this pull request, and why are they necessary?

How was this patch tested?

rzhang10 commented Oct 9, 2024 • edited Loading

KevinGe00 commented Oct 10, 2024 • edited Loading

rzhang10 commented Oct 11, 2024

KevinGe00 commented Oct 9, 2024 •

edited

Loading

rzhang10 commented Oct 9, 2024 •

edited

Loading

KevinGe00 commented Oct 10, 2024 •

edited

Loading