-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] convert metrics to numeric in dataframe #4726
Conversation
It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert.
Codecov Report
@@ Coverage Diff @@
## master #4726 +/- ##
==========================================
- Coverage 72.22% 72.22% -0.01%
==========================================
Files 204 204
Lines 15323 15329 +6
Branches 1180 1181 +1
==========================================
+ Hits 11067 11071 +4
- Misses 4253 4255 +2
Partials 3 3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go, just a few comments.
@@ -170,11 +170,21 @@ def get_df(self, query_obj=None): | |||
if self.datasource.offset: | |||
df[DTTM_ALIAS] += timedelta(hours=self.datasource.offset) | |||
df[DTTM_ALIAS] += self.time_shift | |||
|
|||
self.df_metrics_to_num(df, query_obj.get('metrics') or []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: {}.get
already takes a default value when the key is not present:
self.df_metrics_to_num(df, query_obj.get('metrics', []))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way is a bit safer if the key exists with a None
value it will make it an empty dict. That situation shouldn't occur, but it's a tiny bit safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. I assumed keys were never None
.
superset/viz.py
Outdated
@staticmethod | ||
def df_metrics_to_num(df, metrics): | ||
"""Converting metrics to numeric when pandas.read_sql cannot""" | ||
for col, dtype in df.dtypes.iteritems(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3 dicts have no iteritems()
, better to just use items()
(the cost in Python is building the object, but the dict is small so it shouldn't matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I got confused as I copied the code over and though it must have been backported for compatibility reasons. I'll grep and remove other instances of it where I copied it from.
* [bugfix] convert metrics to numeric in dataframe It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert. * Fix tests * Remove all iteritems
* [bugfix] convert metrics to numeric in dataframe It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert. * Fix tests * Remove all iteritems
* [bugfix] convert metrics to numeric in dataframe It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert. * Fix tests * Remove all iteritems
"""Converting metrics to numeric when pandas.read_sql cannot""" | ||
for col, dtype in df.dtypes.items(): | ||
if dtype.type == np.object_ and col in metrics: | ||
df[col] = pd.to_numeric(df[col]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mistercrunch we've seen issues with this change because metrics are not always numeric. You could have a max of a string for example. I'm going to look into other options for doing this, but if you have any thoughts let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really clarify what a metric is or isn't. My understanding it that metrics are numeric, but I can see how it may not have been enforced in the past, and that in some places it would be interpreted as [only] being the result of an aggregate function that may or may not be numerical (say MAX(my_string)
in the context of a Table viz would just happen to work somehow in the past).
To be clear, the current conceptual model is dimensions are columns or non-aggregate-SQL-expressions, and metrics are always aggregate expressions and numeric. A more complete (but complex) model would be to have columns, sql-expression, and aggregate expressions, and each one may or may not be a metric and/or dimension. This model would require a fair amount of foundational redesign.
How common is this? My incline would be to handle non-numeric aggregate functions outside of the semantic layer, meaning in a SQL Lab subquery or upstream data pipeline.
Note that we could patch something to preserve backward compatibility here. Something like BaseViz.enforce_numeric_metrics = True
that would be set to False
for Table viz, and run the code referenced above conditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about thinking of a metric as an aggregation, and leaving it up to the visualizations to determine what data type is required?
The system can handle max(string) as it's a valid aggregation and gets processed in the system correctly with the exception of this post processing. So I'm not suggesting anything that would require a complex redesign. It seems valid for visualizations to only allow certain types to work correctly, but I think we should leave that up to the visualization instead of having this kind of thing break when returning the dataframe in base viz.
Max(date) is an example use case, which seems valid. A less frequent use case is having sums in a case statment.
cc: @john-bodley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michellethomas what do you think of #5176 ?
* [bugfix] convert metrics to numeric in dataframe It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert. * Fix tests * Remove all iteritems
It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
object
in the dataframe. This results in "No numeric types toaggregate" errors when trying to perform aggregations or pivoting in
pandas.
This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.