Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(viz): improve dtype inference logic #12933

Merged
merged 5 commits into from
Feb 4, 2021

Conversation

villebro
Copy link
Member

@villebro villebro commented Feb 3, 2021

SUMMARY

Currently query result formats are checked based on the dtype property of the DataFrame coming back from the database. This works for some types, but not

  • columns that contain nulls (a known limitation in Pandas which causes an otherwise regular int column to become object)
  • date (=always object)
  • UTC based timestamps (all timestamps were datetime64[ns, UTC] on e.g. BigQuery when the dtype was cast to string, causing it to be returned as a non-temporal type)

This PR replaces the current inference logic with the infer_dtype utility offered by Pandas and adds tests for typical column types. This is really just a quick fix to make table chart column types work properly, as we're currently inferring column datatypes in three (!) different places in the codebase. A bigger refactor uniting these into one place needs to be done ASAP, but is left to a later SIP/PR. This also bumps superset-ui to the latest version which includes a fix for the default temporal format: apache-superset/superset-ui#937

AFTER

This screenshot shows TIMESTAMP, DATETIME and DATE columns on a BigQuery dataset on table chart with this fix
image

BEFORE

The same chart prior to the PR
image

TEST PLAN

Local testing on various databases + new tests

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

Copy link
Member Author

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some annotations

Comment on lines 1073 to 1103
def test_extract_dataframe_dtypes(self):
df = pd.DataFrame(
data={
"dt": [date(2021, 2, 4), date(2021, 2, 4)],
"dttm": [datetime(2021, 2, 4, 1, 1, 1), datetime(2021, 2, 4, 1, 1, 1)],
"str": ["foo", "foo"],
"int": [1, 1],
"float": [0.5, 0.5],
"obj": [{"a": 1}, {"a": 1}],
"dt_null": [None, date(2021, 2, 4)],
"dttm_null": [None, datetime(2021, 2, 4, 1, 1, 1)],
"str_null": [None, "foo"],
"int_null": [None, 1],
"float_null": [None, 0.5],
"obj_null": [None, {"a": 1}],
}
)
assert extract_dataframe_dtypes(df) == [
GenericDataType.TEMPORAL,
GenericDataType.TEMPORAL,
GenericDataType.STRING,
GenericDataType.NUMERIC,
GenericDataType.NUMERIC,
GenericDataType.STRING,
GenericDataType.TEMPORAL,
GenericDataType.TEMPORAL,
GenericDataType.STRING,
GenericDataType.NUMERIC,
GenericDataType.NUMERIC,
GenericDataType.STRING,
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're making sure that all main datatypes are correctly identified both with and without null values.

superset/utils/core.py Outdated Show resolved Hide resolved
@villebro villebro requested a review from ktmud February 3, 2021 23:12
@junlincc junlincc added rush! Requires immediate attention #bug:blocking! Blocking issues with high priority hold:testing! On hold for testing labels Feb 3, 2021
@codecov-io
Copy link

codecov-io commented Feb 3, 2021

Codecov Report

Merging #12933 (b5f54f9) into master (9982fde) will decrease coverage by 2.24%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12933      +/-   ##
==========================================
- Coverage   69.14%   66.90%   -2.25%     
==========================================
  Files        1025      489     -536     
  Lines       48767    28674   -20093     
  Branches     5188        0    -5188     
==========================================
- Hits        33718    19183   -14535     
+ Misses      14915     9491    -5424     
+ Partials      134        0     -134     
Flag Coverage Δ
cypress ?
javascript ?
python 66.90% <100.00%> (-0.72%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/common/query_context.py 82.14% <100.00%> (-0.45%) ⬇️
superset/utils/core.py 88.15% <100.00%> (+0.11%) ⬆️
superset/db_engines/hive.py 0.00% <0.00%> (-85.72%) ⬇️
superset/sql_validators/postgres.py 50.00% <0.00%> (-50.00%) ⬇️
superset/db_engine_specs/hive.py 73.84% <0.00%> (-16.93%) ⬇️
superset/databases/commands/create.py 83.67% <0.00%> (-8.17%) ⬇️
superset/databases/commands/update.py 85.71% <0.00%> (-8.17%) ⬇️
superset/db_engine_specs/sqlite.py 90.62% <0.00%> (-6.25%) ⬇️
superset/connectors/sqla/models.py 84.51% <0.00%> (-6.04%) ⬇️
superset/databases/commands/test_connection.py 84.78% <0.00%> (-4.35%) ⬇️
... and 548 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9982fde...b5f54f9. Read the comment docs.

@ktmud
Copy link
Member

ktmud commented Feb 3, 2021

I think we can first do a mapping based on types returned by pandas.api.types.infer_dtype (which is much more efficient since it uses cpython bindings), then manually infer for mixed when necessary.

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PANDAS_DTYPE_MAP = {
    "date": GenericDataType.TEMPORAL,
    "time": GenericDataType.TEMPORAL,
    "datetime": GenericDataType.TEMPORAL,
    "datetime64": GenericDataType.TEMPORAL,
    "integer": GenericDataType.NUMERIC,
    "floating": GenericDataType.NUMERIC,
    "decimal": GenericDataType.NUMERIC,
    "mixed-integer-float": GenericDataType.NUMERIC,
    "boolean": GenericDataType.BOOLEAN,
}


def extract_dataframe_dtypes(df: pd.DataFrame) -> List[GenericDataType]:
    """Serialize pandas/numpy dtypes to generic types"""
    return [
        PANDAS_DTYPE_MAP.get(infer_dtype(df[col], skipna=True), GenericDataType.STRING)
        for col in df.columns
    ]

Maybe a manual inference of more complex type is not even needed since infer_dtype can already take care of NA's for us.

tests/utils_tests.py Outdated Show resolved Hide resolved
@junlincc junlincc self-requested a review February 4, 2021 02:35
Copy link
Member

@junlincc junlincc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-02-03 at 6 34 48 PM

is this ^ normal? in postgresql @ktmud @villebro

@junlincc
Copy link
Member

junlincc commented Feb 4, 2021

Screen Shot 2021-02-03 at 6 40 45 PM

this one is in mysql ^
the scope of this fix seems large. @ktmud please help

@villebro
Copy link
Member Author

villebro commented Feb 4, 2021

I think we can first do a mapping based on types returned by pandas.api.types.infer_dtype (which is much more efficient since it uses cpython bindings), then manually infer for mixed when necessary.

Good idea, I'll update to use this logic

@villebro
Copy link
Member Author

villebro commented Feb 4, 2021

Screen Shot 2021-02-03 at 6 34 48 PM

is this ^ normal? in postgresql

Let me check this, too

@pull-request-size pull-request-size bot added size/M and removed size/L labels Feb 4, 2021
superset/utils/core.py Outdated Show resolved Hide resolved
superset/utils/core.py Outdated Show resolved Hide resolved
@villebro villebro force-pushed the villebro/fix-table-date-format branch from 4959d52 to b5f54f9 Compare February 4, 2021 13:41
@villebro
Copy link
Member Author

villebro commented Feb 4, 2021

Screen Shot 2021-02-03 at 6 34 48 PM is this ^ normal? in postgresql

Let me check this, too

@junlincc the wacky formats in the table chart have now been fixed as of superset-ui version 0.17.6 (bumped in this PR). Regarding the results and samples tabs below, previously it wasn't possible to apply formatting to other than the __timestamp column, so this isn't a regression. As of #10270 we're now getting column types in the result payload, hence we can apply formatting. I have a PR mostly ready that does this, but I'd rather leave it for a separate PR to minimize regression risk and isolate this PR as a fix (the one that adds proper formatting to results and samples will be a new feature).

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Agreed the fix for data panel should be in a separate PR.

@ktmud ktmud merged commit ac73991 into apache:master Feb 4, 2021
@villebro villebro deleted the villebro/fix-table-date-format branch February 4, 2021 19:07
villebro added a commit to preset-io/superset that referenced this pull request Feb 4, 2021
@junlincc junlincc removed the hold:testing! On hold for testing label Feb 4, 2021
@junlincc junlincc removed #bug:blocking! Blocking issues with high priority rush! Requires immediate attention labels Mar 15, 2021
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels preset-io size/M 🚢 1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants