-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(viz): improve dtype inference logic #12933
Conversation
20a4423
to
944339a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some annotations
tests/utils_tests.py
Outdated
def test_extract_dataframe_dtypes(self): | ||
df = pd.DataFrame( | ||
data={ | ||
"dt": [date(2021, 2, 4), date(2021, 2, 4)], | ||
"dttm": [datetime(2021, 2, 4, 1, 1, 1), datetime(2021, 2, 4, 1, 1, 1)], | ||
"str": ["foo", "foo"], | ||
"int": [1, 1], | ||
"float": [0.5, 0.5], | ||
"obj": [{"a": 1}, {"a": 1}], | ||
"dt_null": [None, date(2021, 2, 4)], | ||
"dttm_null": [None, datetime(2021, 2, 4, 1, 1, 1)], | ||
"str_null": [None, "foo"], | ||
"int_null": [None, 1], | ||
"float_null": [None, 0.5], | ||
"obj_null": [None, {"a": 1}], | ||
} | ||
) | ||
assert extract_dataframe_dtypes(df) == [ | ||
GenericDataType.TEMPORAL, | ||
GenericDataType.TEMPORAL, | ||
GenericDataType.STRING, | ||
GenericDataType.NUMERIC, | ||
GenericDataType.NUMERIC, | ||
GenericDataType.STRING, | ||
GenericDataType.TEMPORAL, | ||
GenericDataType.TEMPORAL, | ||
GenericDataType.STRING, | ||
GenericDataType.NUMERIC, | ||
GenericDataType.NUMERIC, | ||
GenericDataType.STRING, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're making sure that all main datatypes are correctly identified both with and without null values.
Codecov Report
@@ Coverage Diff @@
## master #12933 +/- ##
==========================================
- Coverage 69.14% 66.90% -2.25%
==========================================
Files 1025 489 -536
Lines 48767 28674 -20093
Branches 5188 0 -5188
==========================================
- Hits 33718 19183 -14535
+ Misses 14915 9491 -5424
+ Partials 134 0 -134
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I think we can first do a mapping based on types returned by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PANDAS_DTYPE_MAP = {
"date": GenericDataType.TEMPORAL,
"time": GenericDataType.TEMPORAL,
"datetime": GenericDataType.TEMPORAL,
"datetime64": GenericDataType.TEMPORAL,
"integer": GenericDataType.NUMERIC,
"floating": GenericDataType.NUMERIC,
"decimal": GenericDataType.NUMERIC,
"mixed-integer-float": GenericDataType.NUMERIC,
"boolean": GenericDataType.BOOLEAN,
}
def extract_dataframe_dtypes(df: pd.DataFrame) -> List[GenericDataType]:
"""Serialize pandas/numpy dtypes to generic types"""
return [
PANDAS_DTYPE_MAP.get(infer_dtype(df[col], skipna=True), GenericDataType.STRING)
for col in df.columns
]
Maybe a manual inference of more complex type is not even needed since infer_dtype
can already take care of NA
's for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is in mysql ^ |
Good idea, I'll update to use this logic |
4959d52
to
b5f54f9
Compare
is this ^ normal? in postgresql @junlincc the wacky formats in the table chart have now been fixed as of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Agreed the fix for data panel should be in a separate PR.
SUMMARY
Currently query result formats are checked based on the
dtype
property of theDataFrame
coming back from the database. This works for some types, but notint
column to becomeobject
)object
)datetime64[ns, UTC]
on e.g. BigQuery when thedtype
was cast to string, causing it to be returned as a non-temporal type)This PR replaces the current inference logic with the
infer_dtype
utility offered by Pandas and adds tests for typical column types. This is really just a quick fix to make table chart column types work properly, as we're currently inferring column datatypes in three (!) different places in the codebase. A bigger refactor uniting these into one place needs to be done ASAP, but is left to a later SIP/PR. This also bumpssuperset-ui
to the latest version which includes a fix for the default temporal format: apache-superset/superset-ui#937AFTER
This screenshot shows
TIMESTAMP
,DATETIME
andDATE
columns on a BigQuery dataset on table chart with this fixBEFORE
The same chart prior to the PR
TEST PLAN
Local testing on various databases + new tests
ADDITIONAL INFORMATION