-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] convert metrics to numeric in dataframe #4726
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -170,11 +170,21 @@ def get_df(self, query_obj=None): | |
if self.datasource.offset: | ||
df[DTTM_ALIAS] += timedelta(hours=self.datasource.offset) | ||
df[DTTM_ALIAS] += self.time_shift | ||
|
||
self.df_metrics_to_num(df, query_obj.get('metrics') or []) | ||
|
||
df.replace([np.inf, -np.inf], np.nan) | ||
fillna = self.get_fillna_for_columns(df.columns) | ||
df = df.fillna(fillna) | ||
return df | ||
|
||
@staticmethod | ||
def df_metrics_to_num(df, metrics): | ||
"""Converting metrics to numeric when pandas.read_sql cannot""" | ||
for col, dtype in df.dtypes.items(): | ||
if dtype.type == np.object_ and col in metrics: | ||
df[col] = pd.to_numeric(df[col]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mistercrunch we've seen issues with this change because metrics are not always numeric. You could have a max of a string for example. I'm going to look into other options for doing this, but if you have any thoughts let me know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should really clarify what a metric is or isn't. My understanding it that metrics are numeric, but I can see how it may not have been enforced in the past, and that in some places it would be interpreted as [only] being the result of an aggregate function that may or may not be numerical (say To be clear, the current conceptual model is dimensions are columns or non-aggregate-SQL-expressions, and metrics are always aggregate expressions and numeric. A more complete (but complex) model would be to have columns, sql-expression, and aggregate expressions, and each one may or may not be a metric and/or dimension. This model would require a fair amount of foundational redesign. How common is this? My incline would be to handle non-numeric aggregate functions outside of the semantic layer, meaning in a SQL Lab subquery or upstream data pipeline. Note that we could patch something to preserve backward compatibility here. Something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about thinking of a metric as an aggregation, and leaving it up to the visualizations to determine what data type is required? The system can handle max(string) as it's a valid aggregation and gets processed in the system correctly with the exception of this post processing. So I'm not suggesting anything that would require a complex redesign. It seems valid for visualizations to only allow certain types to work correctly, but I think we should leave that up to the visualization instead of having this kind of thing break when returning the dataframe in base viz. Max(date) is an example use case, which seems valid. A less frequent use case is having sums in a case statment. cc: @john-bodley There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @michellethomas what do you think of #5176 ? |
||
|
||
def query_obj(self): | ||
"""Building a query object""" | ||
form_data = self.form_data | ||
|
@@ -1060,7 +1070,6 @@ def process_data(self, df, aggregate=False): | |
df = df.fillna(0) | ||
if fd.get('granularity') == 'all': | ||
raise Exception(_('Pick a time granularity for your time series')) | ||
|
||
if not aggregate: | ||
df = df.pivot_table( | ||
index=DTTM_ALIAS, | ||
|
@@ -1384,7 +1393,7 @@ def get_data(self, df): | |
pt = (pt / pt.sum()).T | ||
pt = pt.reindex(row.index) | ||
chart_data = [] | ||
for name, ys in pt.iteritems(): | ||
for name, ys in pt.items(): | ||
if pt[name].dtype.kind not in 'biufc' or name in self.groupby: | ||
continue | ||
if isinstance(name, string_types): | ||
|
@@ -1395,7 +1404,7 @@ def get_data(self, df): | |
l = [str(s) for s in name[1:]] # noqa: E741 | ||
series_title = ', '.join(l) | ||
values = [] | ||
for i, v in ys.iteritems(): | ||
for i, v in ys.items(): | ||
x = i | ||
if isinstance(x, (tuple, list)): | ||
x = ', '.join([text_type(s) for s in x]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
{}.get
already takes a default value when the key is not present:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way is a bit safer if the key exists with a
None
value it will make it an empty dict. That situation shouldn't occur, but it's a tiny bit safer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. I assumed keys were never
None
.