Correctly handle single type uniontypes in Coral #507

KevinGe00 · 2024-05-29T22:27:36Z

Intro

The goal of the PR is to fix the unaccounted for side effects of single uniontype (a uniontype holding only one data type such as uniontype<array<string>>) related changes from #409.

Example 1 revealing issues surrounding view text translations on field references for extractions on single union datatypes:

Necesary Context: There is a Spark-specific mechanism that unwraps the uniontype to just the single underlying data type when reading from avro schemas. Reference.

Behavior before this PR:

Hive

Table t1 that has the columns:
union_col           	uniontype<string>
struct_col           	struct<union_field:uniontype<string>>

View v1: select extract_union(union_col).tag_0 as a, extract_union(struct_col).union_field.tag_0 as b from t1
a：string
b: string

Both columns extract out the underlying field in the single uniontypes.

Spark
View v1 translated to:

select coalesce_struct(union_col) a, coalesce_struct(struct_col).union_field b from t1

Trino
View v1 translated to:

select extract_union(union_col) as a, extract_union(struct_col).union_field as b from t1

Issues:

In Trino, single uniontypes like union_col are not unwrapped like it is in spark, therefore extract_union(union_col) would produce struct<tag_0:string> so we need the .tag_0 reference to extract out the underlying datatype.
We shouldn’t be taking 2 operators (extract_union UDF call + field reference) and then club them into a single identifier as it is not modular enough. We cannot assume extract_union is never called on its own and need extract_union on single unions to work on it's own.

Behavior after this PR:

Hive portion remains the same as before

Spark
View v1 translated to:

select coalesce_struct(union_col, 'uniontype<string>').tag_0, (coalesce_struct(struct_col, 'struct<union_field:uniontype<string>>').union_field).tag_0 b from t1

Trino
View v1 translated to:

select extract_union(union_col).tag_0 as a, extract_union(struct_col).union_field.tag_0 as b from t1

Solution to issues:

Remove SingleUnionFieldReferenceTransformer entirely (which we no longer need anyways based on (2)), which was in Hive LHS. Now the Trino translation path for extract_union on single unions are left untouched.
Made changes in the coalesce_struct UDF to add a second optional schema string parameter, which is generated and passed in ExtractUnionFunctionTransformer while we still we have context of the Hive type and any single uniontypes. Now coalesce_struct UDF knows which fields (nested or not) came from a Hive single uniontype and can coalesce it as such.

Example 2 revealing issues surrounding schema type derivations for Trino single uniontype:

Behavior before this PR:

Hive

Table t1 that has the columns:
struct_col           	struct<union_field:uniontype<array<string>>>

View v1: select struct_col as struct_col from t1
struct_col           	struct<union_field:uniontype<array<string>>>

struct_col column Rel datatype in v1 RelNode

RecordType(
  RecordType:peek_no_expand(
    VARCHAR(2147483647) ARRAY union_field, 
    ) 
    struct_col)

Issue: Because of this change to surface the underlying field & its schema in Coral's SqlNode RelNode representation when field is single uniontype, the Rel type of struct_col is simply struct<union_field:array<string>> instead of struct<union_field:uniontype<array<string>>>. This change was intended to fix an issue where the Avro schema for extract_union calls on single unions (generated from RelNode using coral-schema) could not be analyzed by the Spark engine.

However, Coral generates Trino schemas also using RelNode, and Trino now cannot analyze the schema. We have a type coercion problem where Trino cannot coerce a col of type: row(union_field row(tag tinyint, field0 array(varchar))) (a struct representing a single uniontype) to the type of the column stored in the view definition (generated from Coral's RelNode with the uniontype unwrapped) row(union_field array(varchar)).

Behavior after this PR:

Hive portion remains the same as before

struct_col column Rel datatype in v1 RelNode

RecordType(
  RecordType:peek_no_expand(
       RecordType:peek_no_expand(INTEGER tag, VARCHAR(2147483647) ARRAY field0) union_field) 
  struct_col)

Solution to issue: Remove unwrapping of single uniontypes of RelNodes in Hive LHS

Note this change doesn't break Spark's type derivation for single uniontypes at LinkedIn. There is no regression on the schema produced on a column that is a single uniontype in Hive, verified by creating a view with a single uniontype column and describing it in spark shell with this changes from this PR.

How was this patch tested?

Regression test results for trino, spark and avro are all as expected
Unit tests
Tested these changes with new coalesec_struct UDF changes to ensure entire UX is as expected

wmoustafa · 2024-07-17T00:44:16Z

...mmon/src/main/java/com/linkedin/coral/common/utils/RelDataTypeToHiveTypeStringConverter.java

@@ -41,6 +41,12 @@ public class RelDataTypeToHiveTypeStringConverter {
  private RelDataTypeToHiveTypeStringConverter() {
  }

+  public RelDataTypeToHiveTypeStringConverter(boolean convertUnionTypes) {


Please add Java doc on what the effect of the parameter is.

wmoustafa · 2024-07-17T00:45:16Z

...mmon/src/main/java/com/linkedin/coral/common/utils/RelDataTypeToHiveTypeStringConverter.java

+    // Convert single uniontypes back to Hive representation so coalesce_struct UDF can handle
+    // single uniontypes in Spark correctly


Description/code in coral-common should not be custom to Spark or at least reference Spark, when it can be generic.

Removed reference.

wmoustafa · 2024-07-17T01:11:40Z

...ark/src/main/java/com/linkedin/coral/spark/transformers/ExtractUnionFunctionTransformer.java

+
+      if (containsSingleUnionType(operandType)) {
+        // Pass in schema string to keep track of the original Hive schema containing single uniontypes so coalesce_struct


Expand the comment to explain why it is important for coalesce_struct to know this distinction.

wmoustafa · 2024-07-17T01:13:31Z

...ark/src/main/java/com/linkedin/coral/spark/transformers/ExtractUnionFunctionTransformer.java

+ *
+ * Note that uniontypes holding a single need to be handled specially in Spark as there is a Spark-specific mechanism
+ * that unwraps a single uniontype (a uniontype holding only one data type) to just the single underlying data type.
+ * Reference: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion
+ *
+ * Check `CoralSparkTest#testUnionExtractUDFOnSingleTypeUnions` for examples.
+ *


This should state that the behavior is specific to base tables. Giving an example query helps. In fact, you might write a few illustrating different expected behaviors of this transformer.

This should state that the behavior is specific to base tables.

I'm not sure this unwrapping behavior specific to base tables. I created a hive base table with a single union column, then created a view that selects all on the base table. Running desc on both table and view in Spark yield the same schema (single union is unwrapped).

Giving an example query helps

On line 47, the comment points to CoralSparkTest.testUnionExtractUDFOnSingleTypeUnions which has examples covering simple and complex cases of Hive queries that would induce this transformer, and also the expected behavior.

I'm not sure this unwrapping behavior specific to base tables. I created a hive base table with a single union column, then created a view that selects all on the base table. Running desc on both table and view in Spark yield the same schema (single union is unwrapped).

I meant explaining that the main reason for unwrapping relies in base tables. The current description is a bit vague and may not help with the thought process.

Stating examples in the comments helps vs pointing to the test cases.

@wmoustafa Updated this piece of javadoc to include end2end reasoning + examples in comments. PTAL

wmoustafa · 2024-07-17T01:20:09Z

coral-spark/src/main/java/com/linkedin/coral/spark/DataTypeDerivedSqlCallConverter.java

+/**
+ * DataTypeDerivedSqlCallConverter transforms the sqlCalls
+ * in the input SqlNode representation to be compatible with Trino engine.
+ * The transformation may involve change in operator, reordering the operands
+ * or even re-constructing the SqlNode.
+ *
+ * All the transformations performed as part of this shuttle require RelDataType derivation.
+ */


This description is confusing (also class name is confusing).

Why would a spark Shuttle do something specific to Trino?

The description sounds too generic and arbitrary without specific contract.

This class is a copy of DataTypeDerivedSqlCallConverter in rel2trino. I could create a follow up PR to edit the name and javadoc for both DataTypeDerivedSqlCallConverter classes?

Why would a spark Shuttle do something specific to Trino?

Forgot to change "Trino engine." to "Spark engine." after copying, fixed.

wmoustafa · 2024-07-17T01:21:31Z

coral-spark/src/main/java/com/linkedin/coral/spark/CoralSpark.java

    return sparkSqlNode.accept(new SparkSqlRewriter());
  }

  public static String constructSparkSQL(SqlNode sparkSqlNode) {
    return sparkSqlNode.toSqlString(SparkSqlDialect.INSTANCE).getSql();
  }

+  private static boolean isSelectStar(SqlNode node) {


What does this function have to do with single unions?

My apologies for not explaining earlier. After adding in type derivation transformations on coral spark RHS, there was a side-effect where the old detection for select star queries no longer worked requiring us to update the detection logic. This is a more robust check anyhow.

wmoustafa · 2024-07-17T01:24:43Z

coral-spark/src/main/java/com/linkedin/coral/spark/CoralSpark.java

-        .accept(new CoralToSparkSqlCallConverter(sparkUDFInfos));
+
+    SqlNode coralSqlNodeWithRelDataTypeDerivedConversions =
+        coralSqlNode.accept(new DataTypeDerivedSqlCallConverter(hmsClient, coralSqlNode, sparkUDFInfos));


Did not follow the intuition behind this (specifically in the context of single union types).

During coral-spark RHS, RelNode -> SqlNode translation layer, there didn't used to be a need for transformers that require rel type derivation. However now, we need type derivation in ExtractUnionFunctionTransformer (which lives in coral-spark RHS) to detect extract_union calls on single uniontypes. Only if we do detect that it is a call on single uniontypes, then pass in the schema string when transforming extract_union call to coalesce_struct.

Incidentally this is a needed step for Coral IR upgrades, Introduce API to enable type derivation in the SqlNode transformation layer that we now validated is doable. cc: @aastha25

I think we need a discussion around this. Our objective here is to standardize SqlNode to SqlNode conversions to happen strictly through SqlCallTransformers. We should discuss if this API is sufficient and if not, how to organize/standardize things that happen outside it. Objective is to minimize ad hoc transformations, and this seems to add an ad hoc transformation.

Discussed offline that we organize CoralRelNode -> LanguageSqlNode as 3 logical steps:

Apply SqlShuttle for list of SqlCall transformers that require rel type derivation

Apply SqlShuttle for list of SqlCall transformers that do not require rel type derivation

SqlNode transformations that cannot be done at the SqlCall layer (in coral-spark it is the CoralSqlNodeToSparkSqlNodeConverter class)

1 and 2 must be separated into 2 steps in that order as intermixing type derivation transformers with no type derivation transformers causes failures due to no type derivation transformers introducing certain operators that the type derivation util cannot yet handle.

A future PR will be set up to refactor these steps into a well documented class that loops through the 3 SqlShuttles.

…ngConverter

* fix single uniontypes in Coral * remove SingleUnionFieldReferenceTransformer * remove field reference fix to put in separate PR * spotless * update comments * fix comment + add single uniontype test for RelDataTypeToHiveTypeStringConverter * spotless * improve Javadoc for ExtractUnionFunctionTransformer * use html brackets in javadoc

KevinGe00 added 2 commits May 29, 2024 14:54

fix single uniontypes in Coral

fe7754a

remove SingleUnionFieldReferenceTransformer

41accbf

KevinGe00 changed the title ~~Single unions~~ Correctly handle single type uniontypes in Coral May 29, 2024

KevinGe00 mentioned this pull request May 29, 2024

Correctly handle single type uniontypes in Coral #504

Closed

KevinGe00 mentioned this pull request Jun 11, 2024

[Coral-Schema] Fix incorrect type derivation for repeated field reference on UDF calls #510

Merged

KevinGe00 added 2 commits June 11, 2024 19:59

remove field reference fix to put in separate PR

5ddb3bc

spotless

7ef4103

wmoustafa reviewed Jul 17, 2024

View reviewed changes

KevinGe00 added 5 commits July 18, 2024 21:16

update comments

7dc0116

fix comment + add single uniontype test for RelDataTypeToHiveTypeStri…

a18673a

…ngConverter

spotless

8207003

improve Javadoc for ExtractUnionFunctionTransformer

7bf6702

use html brackets in javadoc

fb89e0f

wmoustafa approved these changes Jul 31, 2024

View reviewed changes

KevinGe00 merged commit 74c2ca8 into linkedin:master Jul 31, 2024
1 check passed

KevinGe00 mentioned this pull request Oct 30, 2024

[WIP DO NOT MERGE] Bump Calcite to preserve LATERALs in LATERAL UNNEST calls #542

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly handle single type uniontypes in Coral #507

Correctly handle single type uniontypes in Coral #507

KevinGe00 commented May 29, 2024 •

edited

Loading

wmoustafa Jul 17, 2024

KevinGe00 Jul 19, 2024 •

edited

Loading

wmoustafa Jul 17, 2024

KevinGe00 Jul 19, 2024

wmoustafa Jul 17, 2024

wmoustafa Jul 17, 2024

KevinGe00 Jul 19, 2024

wmoustafa Jul 24, 2024

KevinGe00 Jul 25, 2024 •

edited

Loading

wmoustafa Jul 17, 2024

KevinGe00 Jul 19, 2024

wmoustafa Jul 17, 2024

KevinGe00 Jul 19, 2024

wmoustafa Jul 17, 2024

KevinGe00 Jul 19, 2024 •

edited

Loading

KevinGe00 Jul 24, 2024 •

edited

Loading

wmoustafa Jul 25, 2024

KevinGe00 Jul 31, 2024

		// Convert single uniontypes back to Hive representation so coalesce_struct UDF can handle
		// single uniontypes in Spark correctly


		if (containsSingleUnionType(operandType)) {
		// Pass in schema string to keep track of the original Hive schema containing single uniontypes so coalesce_struct

Correctly handle single type uniontypes in Coral #507

Correctly handle single type uniontypes in Coral #507

Conversation

KevinGe00 commented May 29, 2024 • edited Loading

Intro

Example 1 revealing issues surrounding view text translations on field references for extractions on single union datatypes:

Example 2 revealing issues surrounding schema type derivations for Trino single uniontype:

How was this patch tested?

Choose a reason for hiding this comment

KevinGe00 Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGe00 Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGe00 Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

KevinGe00 Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGe00 commented May 29, 2024 •

edited

Loading

KevinGe00 Jul 19, 2024 •

edited

Loading

KevinGe00 Jul 25, 2024 •

edited

Loading

KevinGe00 Jul 19, 2024 •

edited

Loading

KevinGe00 Jul 24, 2024 •

edited

Loading