Boxplot - wrong logic #12005

li-ana · 2020-12-10T20:07:33Z

Box plot assumes that you will have only 1 observation per timestamp. In my case, this is not true, which means that the box plot will aggregate all of the observations per timestamp using the function that you select, thus it skews the results. It needs to look at raw data, instead of the aggregation.

Query produced by the Superset box plot:
**

**
Results of this query:

Box plot details:

What it should be:

villebro · 2020-12-11T08:19:13Z

While we don't yet support using the full raw data and then calculating the boxplot on that, we recently migrated Boxplot to ECharts and added some features in this PR: #11199 . In the below example I've created a Boxplot where the categories are continents and the distribution is calculated across countries (I'm using the average of total population, as the dataset contains data for multiple years):

If you have a row id, you can use that as the the "Distribute Across" parameter. The plan is to add support for using the raw row data, but I probably won't have time to work on it any time soon.

rumbin · 2021-08-09T21:37:30Z

The problem with calculating the Boxplot metrics on the query result is - even if distributing across a unique column - that the row limit hits hard and silently:
If the number of data points across all series exceeds the row limit, the resulting boxplot is non-deterministically excluding data points without notifying the user.
Non-deterministically, since there is no ORDER BY applied, nor is it configurable.

So we have three issues that are caused by the current Boxplot logic:

There is no way of including all records, as soon as the number of rows exceeds the row limit.
Whether the row limit has been reached is not displayed anywhere.
The row limit excludes records in a non-deterministic fashion, as no explicit ordering is present.

In my eyes, all of these issues can best be covered by calculating all Boxplot metrics per series directly within the SQL query. The only drawback that I can immediately see is that outliers cannot be returned by such a query...

rumbin · 2021-08-09T21:39:18Z

Not sure to what extent it would be useful to create new issues for the three items above...

rumbin · 2021-10-08T20:23:19Z

I filed a separate bug for it: #17042

junlincc · 2021-10-12T21:07:17Z

@rumbin please feel free to open separate issues for all.
for 2. it's happening on all the charts, no?

rumbin · 2021-10-12T21:28:56Z

@junlincc
For 2.: Yes and no.
True is that hitting the row limit is not shown on charts on a dashboard.
However, at least in Explore the row count indicator pill is turning red when the threshold has been reached.
This is normally the case, bit not so for the Box Plot.
That's what I have reported in #17942.

rumbin · 2021-10-12T21:30:58Z

I am going to write up a separate issue for 1. soon, which will elaborate the pros and cons of calculating the box plot metrics in a push down fashion directly in the database...

junlincc added the viz:charts:echarts Related to Echarts label Dec 10, 2020

junlincc added the enhancement:request Enhancement request submitted by anyone from the community label Dec 11, 2020

junlincc added the viz:charts:boxplot Related to the Boxplot chart label Oct 12, 2021

apache locked and limited conversation to collaborators Feb 2, 2022

geido converted this issue into discussion #18423 Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Boxplot - wrong logic #12005

Boxplot - wrong logic #12005

li-ana commented Dec 10, 2020

villebro commented Dec 11, 2020 •

edited

Loading

rumbin commented Aug 9, 2021

rumbin commented Aug 9, 2021

rumbin commented Oct 8, 2021

junlincc commented Oct 12, 2021

rumbin commented Oct 12, 2021

rumbin commented Oct 12, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Boxplot - wrong logic #12005

Boxplot - wrong logic #12005

Comments

li-ana commented Dec 10, 2020

villebro commented Dec 11, 2020 • edited Loading

rumbin commented Aug 9, 2021

rumbin commented Aug 9, 2021

rumbin commented Oct 8, 2021

junlincc commented Oct 12, 2021

rumbin commented Oct 12, 2021

rumbin commented Oct 12, 2021

This issue was moved to a discussion.

villebro commented Dec 11, 2020 •

edited

Loading