-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing numeric precisions in Hive type system #501
Add missing numeric precisions in Hive type system #501
Conversation
Should the precision be patched in Coral or Calcite? Should it go in LHS or IR? Sounds Calcite + IR is the correct answer based on recent discussions? |
There is no error in Calcite code/precision. What's happening is that Calcite uses whatever implementation of the interface
When you say IR do you mean that we need to align on a unified Coral type system whether that's using I may be misunderstanding what you mean by IR here but an IR type system doesn't make sense to me. We need type systems to encapsulate differences concerning type limits and behaviors. For example, in the default SQL system, a DECIMAL can have maximum precision 19, but Hive overrides to 38. We would need this information when converting a Hive input query. |
Ok let me clarify. For Coral vs Calcite: I was under the impression that Hive INT and LONG will be converted to Calcite INT and LONG and then Calcite is responsible to internally figure out each's precision. Looking at the class contract, it seems that it does expect us to fill in the precession as part of the contract, and not simply return INT or LONG and that is it. So now it makes sense that we are obligated to fill in this value as part of the contract. For the second part of the question, what I meant by an IR type system is the "Coral Type system" that all other systems get mapped to (e..g, from LHS). Upon further look, it sounds that this is indeed a Coral type system and not a Hive-specific type system, and the classname is just misleading due to historical reasons. Why is it a Coral type system: because it is in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @KevinGe00!
has this patch been tested against the li-trino-hotfix branch? could you add that in the PR description?
private static final int DEFAULT_TINYINT_PRECISION = 3; | ||
private static final int DEFAULT_SMALLINT_PRECISION = 5; | ||
private static final int DEFAULT_INTEGER_PRECISION = 10; | ||
private static final int DEFAULT_BIGINT_PRECISION = 19; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you link the reference for these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the javadoc to specify where the values are coming from, this entire class was copied from Hive source code but has now deviated. We have a backlog item mentioned above to eventually rename this to CoralTypeSystem
.
@@ -622,6 +622,15 @@ public void testNameSakeColumnNamesShouldGetUniqueIdentifiers() { | |||
assertEquals(generated, expected); | |||
} | |||
|
|||
@Test | |||
public void testUnionIntAndBigInt() { | |||
final String sql = "SELECT a FROM test.tableOne UNION ALL SELECT z FROM test.tableTwo"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) can we add a couple more assertions in the unit test -
(int union all bigint), (bigint union all int), (tinyint union all bigint) etc
(2) could you add another UT (here & in HiveToTrino tests) for CASE WHEN
table.
a = CAST(1 AS BIGINT)
type queries as mentioned in the PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) The reason I originally didn't add more tests is because Calcite already has tests using the same precision values. But definitely agree more tests in Coral don't hurt. I added tests comparing INT to the other 3 integer types since those are the most common comparisons.
(2) Added.
return SqlStdOperatorTable.CAST.createCall(ZERO, operand, getSqlDataTypeSpecForCasting(relDataType)); | ||
} | ||
|
||
private SqlDataTypeSpec getSqlDataTypeSpecForCasting(RelDataType relDataType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add an example of Coral translation for before & after this code change? Since there's no difference in unit tests, does that mean this is a bug or an unnecessary transformation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add an example of Coral translation for before & after this code change
After adding in the precisions, the first case in testSubstrWithTimestampOperator
in HiveToTrinoConverterTest
started to fail because we were creating a faulty type spec that fails this assertion in the Calcite's derive type function for SqlBasicTypeNameSpec
. Therefore getSqlDataTypeSpecForCasting
is a bug, if you look at SqlTypeUtil.convertTypeToSpec
there are more involved steps that involve using precision/scale values.
I did run tests againt the li-trino-hotfix branch but as expected there were regressions outside the scope of the PR since I branched this PR from master. |
coral-trino/src/test/java/com/linkedin/coral/trino/rel2trino/HiveToTrinoConverterTest.java
Outdated
Show resolved
Hide resolved
|
||
@Test | ||
public void testIntCastToBigIntDuringComparison() { | ||
// We're testing that a comparison between INT and BIGINT sees a cast on the more restrictive type to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required? What happens if you query Trino directly with a comparison operation without the CAST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not explicitly required (tested in trino to confirm), however, Calcite (during validation) will implicitly type coerce comparisons between 2 types by casting the more restrictive type to the more restrictive type. This PR just makes sure we give Calcite the right precision values to figure out correctly which type is more/less restrictive.
* add precision for int and bigint for hive type system * use static final variables for precisions * spotless check * fix converting type to typespec bug * add UT * add tinyint and smallint precisions * spotless * spotless * spotless * new UTs * add tableInt to rel to trino tests * spotless * remove redundant test
What changes are proposed in this pull request, and why are they necessary?
Previously, unions between a
bigint
column andint
column would result in anint
column. Which is incorrect sincebigint
is a superset ofint
, i.ebigint
is less restrictive thanint
, and not the other way around. This bug was caused by missing numeric precision valeus inHiveTypeSystem
which is used in calcite to determine the least restrictive type between two columns.There a few changes in this changes, including some subsequent fixes that are needed after adding the precisions in:
HiveTypeSystem
for all integer types: tinyint, smallint, int, bigintFromUtcTimestampOperatorTransformer
. So replace that method with Calcite's already existing util function to properly convert types to specs for us.How was this patch tested?
Snippet of Input Hive query:
where column
a
is of typeint
Previous Trino Translation:
Since Calcite couldn't correctly figure out the least restrictive type because of missing precisions, it would cast the bigint type as int instead of the other way around.
New Trino Translation (regresssion):
Casting happens on the less restrictive type (int)