Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle single type uniontypes in Coral #507

Merged
merged 9 commits into from
Jul 31, 2024

Conversation

KevinGe00
Copy link
Contributor

@KevinGe00 KevinGe00 commented May 29, 2024

Intro

The goal of the PR is to fix the unaccounted for side effects of single uniontype (a uniontype holding only one data type such as uniontype<array<string>>) related changes from #409.

Example 1 revealing issues surrounding view text translations on field references for extractions on single union datatypes:

Necesary Context: There is a Spark-specific mechanism that unwraps the uniontype to just the single underlying data type when reading from avro schemas. Reference.

Behavior before this PR:

Hive

Table t1 that has the columns:
union_col           	uniontype<string>
struct_col           	struct<union_field:uniontype<string>>

View v1: select extract_union(union_col).tag_0 as a, extract_union(struct_col).union_field.tag_0 as b from t1
a:string
b: string

Both columns extract out the underlying field in the single uniontypes.

Spark
View v1 translated to:

select coalesce_struct(union_col) a, coalesce_struct(struct_col).union_field b from t1

Trino
View v1 translated to:

select extract_union(union_col) as a, extract_union(struct_col).union_field as b from t1

Issues:

  1. In Trino, single uniontypes like union_col are not unwrapped like it is in spark, therefore extract_union(union_col) would produce struct<tag_0:string> so we need the .tag_0 reference to extract out the underlying datatype.

  2. We shouldn’t be taking 2 operators (extract_union UDF call + field reference) and then club them into a single identifier as it is not modular enough. We cannot assume extract_union is never called on its own and need extract_union on single unions to work on it's own.

Behavior after this PR:

Hive portion remains the same as before

Spark
View v1 translated to:

select coalesce_struct(union_col, 'uniontype<string>').tag_0, (coalesce_struct(struct_col, 'struct<union_field:uniontype<string>>').union_field).tag_0 b from t1

Trino
View v1 translated to:

select extract_union(union_col).tag_0 as a, extract_union(struct_col).union_field.tag_0 as b from t1

Solution to issues:

  1. Remove SingleUnionFieldReferenceTransformer entirely (which we no longer need anyways based on (2)), which was in Hive LHS. Now the Trino translation path for extract_union on single unions are left untouched.

  2. Made changes in the coalesce_struct UDF to add a second optional schema string parameter, which is generated and passed in ExtractUnionFunctionTransformer while we still we have context of the Hive type and any single uniontypes. Now coalesce_struct UDF knows which fields (nested or not) came from a Hive single uniontype and can coalesce it as such.

Example 2 revealing issues surrounding schema type derivations for Trino single uniontype:

Behavior before this PR:

Hive

Table t1 that has the columns:
struct_col           	struct<union_field:uniontype<array<string>>>

View v1: select struct_col as struct_col from t1
struct_col           	struct<union_field:uniontype<array<string>>>

struct_col column Rel datatype in v1 RelNode

RecordType(
  RecordType:peek_no_expand(
    VARCHAR(2147483647) ARRAY union_field, 
    ) 
    struct_col)

Issue: Because of this change to surface the underlying field & its schema in Coral's SqlNode RelNode representation when field is single uniontype, the Rel type of struct_col is simply struct<union_field:array<string>> instead of struct<union_field:uniontype<array<string>>>. This change was intended to fix an issue where the Avro schema for extract_union calls on single unions (generated from RelNode using coral-schema) could not be analyzed by the Spark engine.

However, Coral generates Trino schemas also using RelNode, and Trino now cannot analyze the schema. We have a type coercion problem where Trino cannot coerce a col of type: row(union_field row(tag tinyint, field0 array(varchar))) (a struct representing a single uniontype) to the type of the column stored in the view definition (generated from Coral's RelNode with the uniontype unwrapped) row(union_field array(varchar)).

Behavior after this PR:

Hive portion remains the same as before

struct_col column Rel datatype in v1 RelNode

RecordType(
  RecordType:peek_no_expand(
       RecordType:peek_no_expand(INTEGER tag, VARCHAR(2147483647) ARRAY field0) union_field) 
  struct_col)

Solution to issue: Remove unwrapping of single uniontypes of RelNodes in Hive LHS

Note this change doesn't break Spark's type derivation for single uniontypes at LinkedIn. There is no regression on the schema produced on a column that is a single uniontype in Hive, verified by creating a view with a single uniontype column and describing it in spark shell with this changes from this PR.

How was this patch tested?

  • Regression test results for trino, spark and avro are all as expected
  • Unit tests
  • Tested these changes with new coalesec_struct UDF changes to ensure entire UX is as expected

@KevinGe00 KevinGe00 changed the title Single unions Correctly handle single type uniontypes in Coral May 29, 2024
@@ -41,6 +41,12 @@ public class RelDataTypeToHiveTypeStringConverter {
private RelDataTypeToHiveTypeStringConverter() {
}

public RelDataTypeToHiveTypeStringConverter(boolean convertUnionTypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Java doc on what the effect of the parameter is.

Copy link
Contributor Author

@KevinGe00 KevinGe00 Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 120 to 121
// Convert single uniontypes back to Hive representation so coalesce_struct UDF can handle
// single uniontypes in Spark correctly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description/code in coral-common should not be custom to Spark or at least reference Spark, when it can be generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed reference.

Comment on lines +79 to +81

if (containsSingleUnionType(operandType)) {
// Pass in schema string to keep track of the original Hive schema containing single uniontypes so coalesce_struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand the comment to explain why it is important for coalesce_struct to know this distinction.

Comment on lines 42 to 48
*
* Note that uniontypes holding a single need to be handled specially in Spark as there is a Spark-specific mechanism
* that unwraps a single uniontype (a uniontype holding only one data type) to just the single underlying data type.
* Reference: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion
*
* Check `CoralSparkTest#testUnionExtractUDFOnSingleTypeUnions` for examples.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should state that the behavior is specific to base tables. Giving an example query helps. In fact, you might write a few illustrating different expected behaviors of this transformer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should state that the behavior is specific to base tables.

I'm not sure this unwrapping behavior specific to base tables. I created a hive base table with a single union column, then created a view that selects all on the base table. Running desc on both table and view in Spark yield the same schema (single union is unwrapped).

Giving an example query helps

On line 47, the comment points to CoralSparkTest.testUnionExtractUDFOnSingleTypeUnions which has examples covering simple and complex cases of Hive queries that would induce this transformer, and also the expected behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this unwrapping behavior specific to base tables. I created a hive base table with a single union column, then created a view that selects all on the base table. Running desc on both table and view in Spark yield the same schema (single union is unwrapped).

I meant explaining that the main reason for unwrapping relies in base tables. The current description is a bit vague and may not help with the thought process.

Stating examples in the comments helps vs pointing to the test cases.

Copy link
Contributor Author

@KevinGe00 KevinGe00 Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmoustafa Updated this piece of javadoc to include end2end reasoning + examples in comments. PTAL

Comment on lines 22 to 29
/**
* DataTypeDerivedSqlCallConverter transforms the sqlCalls
* in the input SqlNode representation to be compatible with Trino engine.
* The transformation may involve change in operator, reordering the operands
* or even re-constructing the SqlNode.
*
* All the transformations performed as part of this shuttle require RelDataType derivation.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is confusing (also class name is confusing).

Why would a spark Shuttle do something specific to Trino?

The description sounds too generic and arbitrary without specific contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is a copy of DataTypeDerivedSqlCallConverter in rel2trino. I could create a follow up PR to edit the name and javadoc for both DataTypeDerivedSqlCallConverter classes?

Why would a spark Shuttle do something specific to Trino?

Forgot to change "Trino engine." to "Spark engine." after copying, fixed.

return sparkSqlNode.accept(new SparkSqlRewriter());
}

public static String constructSparkSQL(SqlNode sparkSqlNode) {
return sparkSqlNode.toSqlString(SparkSqlDialect.INSTANCE).getSql();
}

private static boolean isSelectStar(SqlNode node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this function have to do with single unions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for not explaining earlier. After adding in type derivation transformations on coral spark RHS, there was a side-effect where the old detection for select star queries no longer worked requiring us to update the detection logic. This is a more robust check anyhow.

.accept(new CoralToSparkSqlCallConverter(sparkUDFInfos));

SqlNode coralSqlNodeWithRelDataTypeDerivedConversions =
coralSqlNode.accept(new DataTypeDerivedSqlCallConverter(hmsClient, coralSqlNode, sparkUDFInfos));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not follow the intuition behind this (specifically in the context of single union types).

Copy link
Contributor Author

@KevinGe00 KevinGe00 Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During coral-spark RHS, RelNode -> SqlNode translation layer, there didn't used to be a need for transformers that require rel type derivation. However now, we need type derivation in ExtractUnionFunctionTransformer (which lives in coral-spark RHS) to detect extract_union calls on single uniontypes. Only if we do detect that it is a call on single uniontypes, then pass in the schema string when transforming extract_union call to coalesce_struct.

Copy link
Contributor Author

@KevinGe00 KevinGe00 Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally this is a needed step for Coral IR upgrades, Introduce API to enable type derivation in the SqlNode transformation layer that we now validated is doable. cc: @aastha25

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a discussion around this. Our objective here is to standardize SqlNode to SqlNode conversions to happen strictly through SqlCallTransformers. We should discuss if this API is sufficient and if not, how to organize/standardize things that happen outside it. Objective is to minimize ad hoc transformations, and this seems to add an ad hoc transformation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline that we organize CoralRelNode -> LanguageSqlNode as 3 logical steps:

  1. Apply SqlShuttle for list of SqlCall transformers that require rel type derivation
  2. Apply SqlShuttle for list of SqlCall transformers that do not require rel type derivation
  3. SqlNode transformations that cannot be done at the SqlCall layer (in coral-spark it is the CoralSqlNodeToSparkSqlNodeConverter class)

1 and 2 must be separated into 2 steps in that order as intermixing type derivation transformers with no type derivation transformers causes failures due to no type derivation transformers introducing certain operators that the type derivation util cannot yet handle.

A future PR will be set up to refactor these steps into a well documented class that loops through the 3 SqlShuttles.

@KevinGe00 KevinGe00 merged commit 74c2ca8 into linkedin:master Jul 31, 2024
1 check passed
KevinGe00 added a commit to KevinGe00/coral that referenced this pull request Jul 31, 2024
* fix single uniontypes in Coral

* remove SingleUnionFieldReferenceTransformer

* remove field reference fix to put in separate PR

* spotless

* update comments

* fix comment + add single uniontype test for RelDataTypeToHiveTypeStringConverter

* spotless

* improve Javadoc for ExtractUnionFunctionTransformer

* use html brackets in javadoc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants