-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly handle single type uniontypes in Coral #504
Conversation
…ing new transformer
…columns are spark compliant
Thanks for the extensive description. Something that would help is to state the PR in terms of behavior before and after the PR, through minimal examples. You could state this using Hive, Spark, Trino SQL output and schema output (and maybe Coral schema in this case). Just giving this factual information upfront, helps understand the "What" first. I think the rest of the description discusses the "How" and "Why", but it is always better to be upfront about "What". Can you refactor the description to this format? |
Thanks for the feedback, I've rewritten the description in accordance. |
Could you clarify what this PR transforms just |
There is no transformation, since the assumption is that |
Close this PR as we have #507 |
Intro
The goal of the PR is to fix the unaccounted for side effects of single uniontype (a uniontype holding only one data type such as
uniontype<array<string>>
) related changes from #409.Example 1 revealing issues surrounding view text translations on field references for extractions on single union datatypes:
Necesary Context: There is a Spark-specific mechanism that unwraps the uniontype to just the single underlying data type when reading from avro schemas. Reference.
Behavior before this PR:
Hive
Spark
View v1 translated to:
Trino
View v1 translated to:
Issues:
In Trino, single uniontypes like
union_col
are not unwrapped like it is in spark, so we need the.tag_0
reference to extract out the underlying datatype.The contract of
coalesce_struct
(spark equivalent of Hive'sextract_union
) states that it must only return structs. However, since Spark will map a single type uniontype to the underlying datatype, whatcoalesce_struct(union_col)
is actually doing is taking in anarray<string>
and returning the same thing. This was a known issue reported in Extract_union doesnt return struct #419 after PR Generate Coral's RelNode for views from base table schema #409.extract_union(struct_col).union_field
is as wanted since union_field won't need thetag_0
field reference for the same reasons.Behavior after this PR:
Hive portion remains the same as before
Spark
View v1 translated to:
Trino
View v1 translated to:
Solution to issues:
Move
SingleUnionFieldReferenceTransformer
to Spark RHS from Hive LHS as unwrapping single uniontypes happens only in Spark. Note, this required us to bring in a Rel datatype derivation on Spark RHS to detect such cases of single uniontypes.Fix
SingleUnionFieldReferenceTransformer
to translateextract_union(union_col).tag_0
in Hive to justunion_col
only for Spark.Example 2 revealing issues surrounding schema type derivations on single uniontype:
Behavior before this PR:
Hive
struct_col
column Rel datatype in v1 RelNodeIssue: Because of this change to surface the underlying field & its schema in Coral's SqlNode RelNode representation when field is single uniontype, the Rel type of
struct_col
is simplystruct<union_field:array<string>>
instead ofstruct<union_field:uniontype<array<string>>>
. This change was intended to fix the issue where the Avro schema (generated from RelNode using coral-schema) could not be analyzed by the Spark engine, since the single uniontype was not properly unwrapped. However, Coral generates Trino schemas also using RelNode, and Trino now cannot analyze the schema. We have a type coercion problem where Trino cannot coerce a col of type:row(union_field row(tag tinyint, field0 array(varchar)))
(a struct representing a single uniontype) to the type of the column stored in the view definition (generated from Coral's RelNode with the uniontype unwrapped)row(union_field array(varchar))
.Behavior after this PR:
Hive portion remains the same as before
struct_col
column Rel datatype in v1 RelNodeSolution to issue:
Since a Spark-only type derivation case (unwrap single uniontype) was introduced in the LHS, it must be taken out and re-introduced in a Spark-only type derivation path.
We now detect for single uniontypes in
RelDataTypeToAvroType
and do the unwrapping there.How was this patch tested?
Unit tests:
CoralSparkTest.testUnionExtractUDFOnSingleTypeUnions
to test Hive to Spark SQL on single union extract field referencesViewToAvroSchemaConverterTests.testSingleUnionFieldReference
to to test Hive to Spark/Avro schema on single union extract field referencesHiveToRelConverterTest.testSingleTypeUnion
andHiveTableTest.testTableWithSingleUnion
to test Hive SQL to RelNode does not unwrap single uniontypes as Trino requires this behaviorcoalesce_struct(single_union)
to justsingle_union
, which is how Extract_union doesnt return struct #419 is fixed