-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve database type inference #4724
Conversation
247dff7
to
0e6b775
Compare
Codecov Report
@@ Coverage Diff @@
## master #4724 +/- ##
==========================================
+ Coverage 61.31% 61.32% +<.01%
==========================================
Files 369 369
Lines 23468 23483 +15
Branches 2717 2717
==========================================
+ Hits 14390 14401 +11
- Misses 9066 9070 +4
Partials 12 12
Continue to review full report at Codecov.
|
superset/dataframe.py
Outdated
column_names = [col[0] for col in cursor_description] | ||
self.column_names = dedup(column_names) | ||
|
||
# check whether the result set has any nested dict columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment confused me... we're not looking for nested dicts here, just dicts, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved that code from sql_lab.convert_results_to_df
, but I think you're right about the comment. I didn't write that line originally.
superset/dataframe.py
Outdated
# check whether the result set has any nested dict columns | ||
if data: | ||
has_dict_col = any([isinstance(c, dict) for c in data[0]]) | ||
df_data = list(data) if has_dict_col else np.array(data, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you doing this because if a column has dicts all columns will have the object
type in the dataframe when initialized from the numpy array? If so, why not use infer_objects
here?
df_data = np.array(data or [], dtype=object)
df = pd.DataFrame(df_data, columns=column_names).infer_objects()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just relocating the code here. I agree that it's opaque.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the behavior with [(1, 2, {'a': 'b'})]
as data, trying to understand the code. If we build the dataframe using np.array
on data with a dict we get the following types for the columns:
>>> data = [(1, 2, {'a': 'b'})]
>>> df = pd.DataFrame(np.array(data, dtype=object))
>>> print(df.dtypes)
0 object
1 object
2 object
dtype: object
Here Pandas doesn't know that the first two columns are integers. OTOH, using list(data)
:
>>> data = [(1, 2, {'a': 'b'})]
>>> df = pd.DataFrame(list(data))
>>> print(df.dtypes)
0 int64
1 int64
2 object
dtype: object
Now we have correct type information for all columns. So it seems to me that the list(data) if has_dict_col else np.array(data, dtype=object)
is trying to preserve the type information.
We can get the same result using infer_objects()
:
>>> data = [(1, 2, {'a': 'b'})]
>>> df = pd.DataFrame(np.array(data, dtype=object)).infer_objects()
>>> print(df.dtypes)
0 int64
1 int64
2 object
dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the dtype=object
was recently added here:
#4629
@michellethomas
It could be one of those where by fixing a bug another one popped up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added dtype=object
just to solve an issue because np.array was erroring with data with different types in a tuple ([('str', 1, [1], 1.0)]
). The list(data) if...
logic was there already. infer_objects
sounds like a good option though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, looks like infer_objects
does much of what Bogdan was trying to do later on in dataframe.py
, maybe we can get ride of some of that code, though I hate to touch this as it's impossible to fully understand the implications... I'll read the source for it and decide if I can remove more code.
superset/dataframe.py
Outdated
@@ -95,7 +141,7 @@ def agg_func(cls, dtype, column_name): | |||
# consider checking for key substring too. | |||
if cls.is_id(column_name): | |||
return 'count_distinct' | |||
if (issubclass(dtype.type, np.generic) and | |||
if (hasattr(dtype, 'tupe') and issubclass(dtype.type, np.generic) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo? tuple
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, the intent was to check that dtype.type
actually exists. I'm not 100% sure whether I actually hit a missing attr or if it was preventative.
superset/db_engine_specs.py
Outdated
for k in dir(ft) | ||
if not k.startswith('_') | ||
} | ||
print(cls.type_code_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
superset/db_engine_specs.py
Outdated
@@ -550,6 +575,11 @@ def adjust_database_uri(cls, uri, selected_schema=None): | |||
uri.database = database | |||
return uri | |||
|
|||
@classmethod | |||
def get_datatype(cls, type_code): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is identical to the base class and can be removed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the case, I know that Presto should be this method, but I'm unclear on what the base class should do to capture the most common use cases.
@betodealmeida I rebased and I'm pretty sure I addressed all comments, what dyou think? |
@vylc I think this should do the trick for the infamous |
a1336f2
to
6e8050b
Compare
Python's DBAPI isn't super clear and homogeneous on the cursor.description specification, and this PR attempts to improve inferring the datatypes returned in the cursor. This work started around Presto's TIMESTAMP type being mishandled as string as the database driver (pyhive) returns it as a string. The work here fixes this bug and does a better job at inferring MySQL and Presto types. It also creates a new method in db_engine_specs allowing for other databases engines to implement and become more precise on type-inference as needed.
* Improve database type inference Python's DBAPI isn't super clear and homogeneous on the cursor.description specification, and this PR attempts to improve inferring the datatypes returned in the cursor. This work started around Presto's TIMESTAMP type being mishandled as string as the database driver (pyhive) returns it as a string. The work here fixes this bug and does a better job at inferring MySQL and Presto types. It also creates a new method in db_engine_specs allowing for other databases engines to implement and become more precise on type-inference as needed. * Fixing tests * Adressing comments * Using infer_objects * Removing faulty line * Addressing PrestoSpec redundant method comment * Fix rebase issue * Fix tests (cherry picked from commit 777d876)
* Improve database type inference Python's DBAPI isn't super clear and homogeneous on the cursor.description specification, and this PR attempts to improve inferring the datatypes returned in the cursor. This work started around Presto's TIMESTAMP type being mishandled as string as the database driver (pyhive) returns it as a string. The work here fixes this bug and does a better job at inferring MySQL and Presto types. It also creates a new method in db_engine_specs allowing for other databases engines to implement and become more precise on type-inference as needed. * Fixing tests * Adressing comments * Using infer_objects * Removing faulty line * Addressing PrestoSpec redundant method comment * Fix rebase issue * Fix tests
* Improve database type inference Python's DBAPI isn't super clear and homogeneous on the cursor.description specification, and this PR attempts to improve inferring the datatypes returned in the cursor. This work started around Presto's TIMESTAMP type being mishandled as string as the database driver (pyhive) returns it as a string. The work here fixes this bug and does a better job at inferring MySQL and Presto types. It also creates a new method in db_engine_specs allowing for other databases engines to implement and become more precise on type-inference as needed. * Fixing tests * Adressing comments * Using infer_objects * Removing faulty line * Addressing PrestoSpec redundant method comment * Fix rebase issue * Fix tests
Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.
This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.
Note that the role of
SupersetDataFrame
isn't super clear here. I think the original intent was to derive pandas.DataFrame but this isn't allowed by the pandas lib. So it became more of a wrapper aroundpandas.DataFrame
.Now its role becomes even less clear here after the PR.
The longer term vision is probably to make this class a pandas DataFrame-related utility wrapper, meaning it would know how to get a dataframe give a query & cursor and will be able to do things like type-inference or other operation that are dataframe related.