Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handling of column types for Presto, Trino, et al. #28653

Merged

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented May 23, 2024

SUMMARY

This is a rehash of @brouberol's #26782 which was later reverted in #28613 as it was problematic with Trino, given both the PrestoEngineSpec and TrinoEngineSpec derive from the PrestoBaseEngineSpec.

The TL;DR is typically most SQLAlchemy dialects (including Trino) use native SQLAlchemy types for encoding column types, whereas PyHive is unconventional and uses str. I asked a similar question years ago and you can see here how these types are materialized.

I hypothesize that part of the confusion when @brouberol's was authoring their PR was that the ResultSetColumnType TypedDict was incorrect, alluding to the fact that the type field was Optional[str] as opposed to Optional[Union[str, TypeEngine, type[TypeEngine]]. This PR simply ensures that in the PrestoBaseEngineSpec class—which handles Hive, Presto, and Trino—that we only try to coerce str types to a SQLAlchemy type (if available).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Added unit tests.

ADDITIONAL INFORMATION

@john-bodley john-bodley changed the title John bodley fix presto column types fix: Handling of column types for Presto, Trino, et al. May 23, 2024
Copy link

codecov bot commented May 23, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 83.50%. Comparing base (76d897e) to head (6a35459).
Report is 158 commits behind head on master.

Files Patch % Lines
superset/db_engine_specs/presto.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #28653       +/-   ##
===========================================
+ Coverage   60.48%   83.50%   +23.01%     
===========================================
  Files        1931      522     -1409     
  Lines       76236    37497    -38739     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    31313    -14801     
+ Misses      28017     6184    -21833     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 49.12% <37.50%> (-0.05%) ⬇️
javascript ?
mysql 77.16% <62.50%> (?)
postgres 77.30% <62.50%> (?)
presto 53.67% <37.50%> (-0.13%) ⬇️
python 83.50% <87.50%> (+20.02%) ⬆️
sqlite 76.75% <62.50%> (?)
unit 58.97% <75.00%> (+1.35%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@john-bodley john-bodley marked this pull request as ready for review May 23, 2024 21:57
@dosubot dosubot bot added data:connect:presto Related to Presto data:connect:trino Related to Trino labels May 23, 2024
@john-bodley john-bodley requested a review from villebro May 24, 2024 14:37
assert result is not None
actual = result.compile(
dialect=PrestoDialect(), compile_kwargs={"literal_binds": True}
assert (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same logic as before just condensed.

@admsev
Copy link

admsev commented May 25, 2024

@gooroodev review

@gooroodev
Copy link

1. Summary of Changes

The pull request addresses the handling of column types for Presto and Trino engines in the Superset project. The key changes include:

  • Presto Engine:
    • Handling of column types in the get_children and where_latest_partition methods.
    • Added logic to convert string column types to SQLAlchemy types dynamically.
  • Trino Engine:
    • Added a similar handling of column types in the where_latest_partition method.
    • Added unit tests for the where_latest_partition method.
  • Typing Changes:
    • Introduced a new type alias SQLType in superset_typing.py to represent SQLAlchemy column types.
    • Updated ResultSetColumnType to use SQLType.

2. Issues, Bugs, or Typos

Issue 1:

  • File: superset/db_engine_specs/presto.py
  • Code:
    match = pattern.match(cast(str, column["type"]))
    Problem: The use of cast(str, column["type"]) is unnecessary if column["type"] is already a string.
    Improvement:
    match = pattern.match(column["type"])

Issue 2:

  • File: superset/db_engine_specs/presto.py
  • Code:
    if isinstance(col_type, str):
        if col_type_class := getattr(types, col_type, None):
            col_type = col_type_class()
        else:
            col_type = None
    Problem: The assignment of None to col_type if col_type_class is not found can be simplified.
    Improvement:
    if isinstance(col_type, str):
        col_type_class = getattr(types, col_type, None)
        col_type = col_type_class() if col_type_class else None

3. General Review of Code Quality and Style

  • Code Quality: The code changes are well-structured and address the problem of handling different column types effectively. The use of type hinting and type checking improves code readability and maintainability.
  • Code Style: The code adheres to PEP 8 standards, making it clean and easy to read. The use of modern Python features like type hinting and := operator (walrus operator) is commendable.
  • Testing: The addition of unit tests for the where_latest_partition method in both Presto and Trino engines ensures that the changes are tested and validated.

Additional Suggestions

  • Documentation: Consider adding docstrings to the modified methods to explain the purpose and functionality of the changes.
  • Error Handling: The use of broad exception handling (raise Exception) could be improved by raising more specific exceptions to provide better context in case of errors.

Overall, the pull request is well-implemented and addresses the problem effectively. The suggested improvements are minor and aim to further enhance the code quality.

Yours, Gooroo.dev. To receive reviews automatically, install Github App

@john-bodley john-bodley force-pushed the john-bodley--fix-presto-column-types branch from 6a35459 to f4ec019 Compare May 28, 2024 15:19
@john-bodley john-bodley merged commit 4ff1740 into apache:master May 28, 2024
30 checks passed
@john-bodley john-bodley deleted the john-bodley--fix-presto-column-types branch May 28, 2024 15:56
@michael-s-molina michael-s-molina added the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label May 28, 2024
EnxDev pushed a commit to EnxDev/superset that referenced this pull request May 31, 2024
eschutho pushed a commit that referenced this pull request Jun 23, 2024
@mistercrunch mistercrunch added 🍒 4.0.2 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels labels Jul 24, 2024
irublev pushed a commit to HighviewPower/superset that referenced this pull request Oct 29, 2024
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels data:connect:presto Related to Presto data:connect:trino Related to Trino size/L v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch 🍒 4.0.2 🚢 4.1.0
Projects
None yet
5 participants