-
Notifications
You must be signed in to change notification settings - Fork 14.3k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boxplot - wrong logic #12005
Comments
While we don't yet support using the full raw data and then calculating the boxplot on that, we recently migrated Boxplot to ECharts and added some features in this PR: #11199 . In the below example I've created a Boxplot where the categories are continents and the distribution is calculated across countries (I'm using the average of total population, as the dataset contains data for multiple years): If you have a row id, you can use that as the the "Distribute Across" parameter. The plan is to add support for using the raw row data, but I probably won't have time to work on it any time soon. |
The problem with calculating the Boxplot metrics on the query result is - even if distributing across a unique column - that the row limit hits hard and silently: So we have three issues that are caused by the current Boxplot logic:
In my eyes, all of these issues can best be covered by calculating all Boxplot metrics per series directly within the SQL query. The only drawback that I can immediately see is that outliers cannot be returned by such a query... |
Not sure to what extent it would be useful to create new issues for the three items above... |
I filed a separate bug for it: #17042 |
@rumbin please feel free to open separate issues for all. |
@junlincc |
I am going to write up a separate issue for 1. soon, which will elaborate the pros and cons of calculating the box plot metrics in a push down fashion directly in the database... |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Box plot assumes that you will have only 1 observation per timestamp. In my case, this is not true, which means that the box plot will aggregate all of the observations per timestamp using the function that you select, thus it skews the results. It needs to look at raw data, instead of the aggregation.
Query produced by the Superset box plot:
**
**
Results of this query:
Box plot details:
What it should be:
The text was updated successfully, but these errors were encountered: