Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle single type uniontypes in Coral #504

Closed
wants to merge 17 commits into from

Conversation

KevinGe00
Copy link
Contributor

@KevinGe00 KevinGe00 commented May 4, 2024

Intro

The goal of the PR is to fix the unaccounted for side effects of single uniontype (a uniontype holding only one data type such as uniontype<array<string>>) related changes from #409.

Example 1 revealing issues surrounding view text translations on field references for extractions on single union datatypes:

Necesary Context: There is a Spark-specific mechanism that unwraps the uniontype to just the single underlying data type when reading from avro schemas. Reference.

Behavior before this PR:

Hive

Table t1 that has the columns:
union_col           	uniontype<array<string>>
struct_col           	struct<union_field:uniontype<array<string>>>

View v1: select extract_union(union_col).tag_0 as a, extract_union(struct_col).union_field.tag_0 as b from t1
a           	array<string>
b              array<string>

Both columns are trying to extract out the underlying field in the single uniontypes.

Spark
View v1 translated to:

select coalesce_struct(union_col) a, coalesce_struct(struct_col).union_field b from t1

Trino
View v1 translated to:

select extract_union(union_col) as a, extract_union(struct_col).union_field as b from t1

Issues:

  1. In Trino, single uniontypes like union_col are not unwrapped like it is in spark, so we need the .tag_0 reference to extract out the underlying datatype.

  2. The contract of coalesce_struct (spark equivalent of Hive's extract_union) states that it must only return structs. However, since Spark will map a single type uniontype to the underlying datatype, what coalesce_struct(union_col) is actually doing is taking in an array<string> and returning the same thing. This was a known issue reported in Extract_union doesnt return struct #419 after PR Generate Coral's RelNode for views from base table schema #409. extract_union(struct_col).union_field is as wanted since union_field won't need the tag_0 field reference for the same reasons.

Behavior after this PR:

Hive portion remains the same as before

Spark
View v1 translated to:

select union_col a, coalesce_struct(struct_col).union_field b from t1

Trino
View v1 translated to:

select extract_union(union_col).tag_0 as a, extract_union(struct_col).union_field.tag_0 as b from t1

Solution to issues:

  1. Move SingleUnionFieldReferenceTransformer to Spark RHS from Hive LHS as unwrapping single uniontypes happens only in Spark. Note, this required us to bring in a Rel datatype derivation on Spark RHS to detect such cases of single uniontypes.

  2. Fix SingleUnionFieldReferenceTransformer to translate extract_union(union_col).tag_0 in Hive to just union_col only for Spark.

Example 2 revealing issues surrounding schema type derivations on single uniontype:

Behavior before this PR:

Hive

Table t1 that has the columns:
struct_col           	struct<union_field:uniontype<array<string>>>

View v1: select struct_col as struct_col from t1
struct_col           	struct<union_field:uniontype<array<string>>>

struct_col column Rel datatype in v1 RelNode

RecordType(
  RecordType:peek_no_expand(
    VARCHAR(2147483647) ARRAY union_field, 
    ) 
    struct_col)

Issue: Because of this change to surface the underlying field & its schema in Coral's SqlNode RelNode representation when field is single uniontype, the Rel type of struct_col is simply struct<union_field:array<string>> instead of struct<union_field:uniontype<array<string>>>. This change was intended to fix the issue where the Avro schema (generated from RelNode using coral-schema) could not be analyzed by the Spark engine, since the single uniontype was not properly unwrapped. However, Coral generates Trino schemas also using RelNode, and Trino now cannot analyze the schema. We have a type coercion problem where Trino cannot coerce a col of type: row(union_field row(tag tinyint, field0 array(varchar))) (a struct representing a single uniontype) to the type of the column stored in the view definition (generated from Coral's RelNode with the uniontype unwrapped) row(union_field array(varchar)).

Behavior after this PR:

Hive portion remains the same as before

struct_col column Rel datatype in v1 RelNode

RecordType(
  RecordType:peek_no_expand(
       RecordType:peek_no_expand(INTEGER tag, VARCHAR(2147483647) ARRAY field0) union_field) 
  struct_col)

Solution to issue:
Since a Spark-only type derivation case (unwrap single uniontype) was introduced in the LHS, it must be taken out and re-introduced in a Spark-only type derivation path.

Trino schema derivation: RelNode -> Schema in Trino
Spark schema derivation: RelNode -(handle special spark typing here)-> Avro -> Schema in Spark

We now detect for single uniontypes in RelDataTypeToAvroType and do the unwrapping there.

How was this patch tested?

Unit tests:

  • CoralSparkTest.testUnionExtractUDFOnSingleTypeUnions to test Hive to Spark SQL on single union extract field references
  • ViewToAvroSchemaConverterTests.testSingleUnionFieldReference to to test Hive to Spark/Avro schema on single union extract field references
  • HiveToRelConverterTest.testSingleTypeUnion and HiveTableTest.testTableWithSingleUnion to test Hive SQL to RelNode does not unwrap single uniontypes as Trino requires this behavior
  • regression test results as expected, for example coalesce_struct(single_union) to just single_union, which is how Extract_union doesnt return struct #419 is fixed

@KevinGe00 KevinGe00 changed the title Correctly handle single type uniontypes in Coral [WIP Correctly handle single type uniontypes in Coral May 6, 2024
@KevinGe00 KevinGe00 changed the title [WIP Correctly handle single type uniontypes in Coral [WIP] Correctly handle single type uniontypes in Coral May 6, 2024
@wmoustafa
Copy link
Contributor

Thanks for the extensive description. Something that would help is to state the PR in terms of behavior before and after the PR, through minimal examples. You could state this using Hive, Spark, Trino SQL output and schema output (and maybe Coral schema in this case). Just giving this factual information upfront, helps understand the "What" first. I think the rest of the description discusses the "How" and "Why", but it is always better to be upfront about "What". Can you refactor the description to this format?

@KevinGe00
Copy link
Contributor Author

Thanks for the extensive description. Something that would help is to state the PR in terms of behavior before and after the PR, through minimal examples. You could state this using Hive, Spark, Trino SQL output and schema output (and maybe Coral schema in this case). Just giving this factual information upfront, helps understand the "What" first. I think the rest of the description discusses the "How" and "Why", but it is always better to be upfront about "What". Can you refactor the description to this format?

Thanks for the feedback, I've rewritten the description in accordance.

@KevinGe00 KevinGe00 changed the title [WIP] Correctly handle single type uniontypes in Coral Correctly handle single type uniontypes in Coral May 8, 2024
@wmoustafa
Copy link
Contributor

Could you clarify what this PR transforms just extract_union(union_col) to in case of Hive to Spark and single union?

@KevinGe00
Copy link
Contributor Author

Could you clarify what this PR transforms just extract_union(union_col) to in case of Hive to Spark and single union?

There is no transformation, since the assumption is that extract_union will always be called with a field reference such as tag_0. However, I realize now that is not an assumption we can strictly make. Instead we can translate Hive extract_union(union_col) to just union_col in Spark.

@KevinGe00
Copy link
Contributor Author

Close this PR as we have #507

@KevinGe00 KevinGe00 closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants