Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[box plot] rows pill shows aggregated number; row limit can be hit silently #17042

Closed
rumbin opened this issue Oct 8, 2021 · 9 comments
Closed
Labels
#bug Bug report viz:charts:boxplot Related to the Boxplot chart

Comments

@rumbin
Copy link
Contributor

rumbin commented Oct 8, 2021

Explore's pill indicator of the number of result rows is showing wrong numbers for Box Plots.
Instead of the number of rows returned from the DB query it displays the number of aggregated rows.

This way, users have no clue if the row limit kicked in and the box plot is based on incomplete data.

Furthermore, the row limit is...

  • not adjustable for this plot
  • not communicated to the user
  • applied without imposing any sort order, thus resulting in a potentially arbitrary sample of rows being evaluated by the box plot

How to reproduce the bug

  1. Use a dataset that has more rows than the configured ROW_LIMIT
  2. Create a box plot on a numeric column, distribute across a column with a high cardinality, e.g., the primary key of the dataset.
  3. Run
  4. Observe the number of rows displayed in the indicator pill

Expected results

Minimal:

  • The actual number of rows returned by the DB query is shown in the indicator.
  • If the ROW_LIMIT is hit, the indicator turns red, like it does with all the other visualizations.

Ideal, additionally:

  • The row limit is configurable, maybe even disabled, like for the histogram, iirc
  • If a row limit is applied, some sensible ORDER BY should be applied in order to at least yield deterministic, if incomplete, results.

Actual results

  1. Only the number of aggregated rows are shown, equalling the number of series displayed in the chart.
  2. This number is consistent with the Data table below the chart, which also only shows one row per series (box).

Screenshots/Screencasts

The row count pill shows only 7 result rows:

image

The Data table lists these aggregated rows of the 7 distinct series:

image

The query applies the configured ROW_LIMIT of 100000:

image

A Big Number that calculates the distinct count of entities that the box plot was distributed across proves that the ccardinality is much higher than the ROW_LIMIT and therefore the boxplot was based on incomplete data witghout the row count pill turning red:

image

Environment

  • browser type and version: Chrome 93.0.4577.63
  • superset version: 1.3.1, installed via pip
  • python version: 3.8.11
  • node.js version: v4.6.1
  • feature flags active:
    "THUMBNAILS": True,
    "ALERT_REPORTS": True,
    "ALERTS_ATTACH_REPORTS": True,
    "SQLLAB_BACKEND_PERSISTENCE": True,
    "ENABLE_TEMPLATE_PROCESSING": True,                                                                 
    "DASHBOARD_NATIVE_FILTERS": True, 
    "DASHBOARD_CROSS_FILTERS": True,
    "DASHBOARD_NATIVE_FILTERS_SET": True,
    "ENABLE_EXPLORE_DRAG_AND_DROP": True,
    "DASHBOARD_CACHE": True     

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • [ x] I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • [ x] I have reproduced the issue with at least the latest released version of superset.
  • [ x] I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

´pip freeze´

aiohttp==3.7.4.post0
alembic==1.7.3
amqp==2.6.1
apache-superset==1.3.1
apispec==3.3.2
asn1crypto==1.4.0
async-timeout==3.0.1
attrs==21.2.0
azure-common==1.1.27
azure-core==1.18.0
azure-storage-blob==12.9.0
Babel==2.9.1
backoff==1.11.1
billiard==3.6.4.0
bleach==3.3.1
boto3==1.18.51
botocore==1.21.51
Brotli==1.0.9
cachelib==0.1.1
cachetools==4.2.4
celery==4.4.7
certifi==2021.5.30
cffi==1.14.6
chardet==4.0.0
charset-normalizer==2.0.6
click==7.1.2
cmdstanpy==0.9.68
colorama==0.4.4
convertdate==2.3.2
cron-descriptor==1.2.24
croniter==1.0.15
cryptography==3.4.8
cx-Oracle==8.2.1
cycler==0.10.0
Cython==0.29.24
defusedxml==0.7.1
deprecation==2.1.0
dnspython==2.1.0
elasticsearch==7.13.4
elasticsearch-dbapi==0.2.6
email-validator==1.1.3
ephem==4.1
et-xmlfile==1.1.0
Flask==1.1.4
Flask-AppBuilder==3.3.3
Flask-Babel==1.0.0
Flask-Caching==1.10.1
Flask-Compress==1.10.1
Flask-JWT-Extended==3.25.1
Flask-Login==0.4.1
Flask-Migrate==3.1.0
Flask-OpenID==1.3.0
Flask-SQLAlchemy==2.5.1
flask-talisman==0.8.1
Flask-WTF==0.14.3
future==0.18.2
geographiclib==1.52
geopy==2.2.0
gevent==21.8.0
google-api-core==2.0.1
google-auth==2.2.1
google-cloud-bigquery==2.27.1
google-cloud-core==2.0.0
google-crc32c==1.2.0
google-resumable-media==2.0.3
googleapis-common-protos==1.53.0
graphlib-backport==1.0.3
greenlet==1.1.2
grpcio==1.41.0
gunicorn==20.0.4
hdbcli==2.10.13
holidays==0.10.3
humanize==3.11.0
idna==3.2
importlib-resources==5.2.2
isodate==0.6.0
itsdangerous==1.1.0
Jinja2==2.11.3
jmespath==0.10.0
jsonschema==3.2.0
kiwisolver==1.3.2
kombu==4.6.11
korean-lunar-calendar==0.2.1
LunarCalendar==0.0.9
Mako==1.1.5
Markdown==3.3.4
MarkupSafe==2.0.1
marshmallow==3.13.0
marshmallow-enum==1.5.1
marshmallow-sqlalchemy==0.23.1
matplotlib==3.4.3
msgpack==1.0.2
msrest==0.6.21
multidict==5.1.0
numpy==1.21.2
oauthlib==3.1.1
openpyxl==3.0.9
oscrypto==1.2.1
packaging==21.0
pandas==1.2.5
parsedatetime==2.6
pgsanity==0.2.9
Pillow==8.3.2
polyline==1.4.0
prison==0.2.1
prophet==1.0
proto-plus==1.19.2
protobuf==3.18.0
psycopg2-binary==2.8.6
pyarrow==4.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybigquery==0.10.2
pycparser==2.20
pycryptodomex==3.10.4
pyhdb==0.3.4
PyJWT==1.7.1
PyMeeus==0.5.11
pyOpenSSL==20.0.1
pyparsing==2.4.7
pyrsistent==0.18.0
pystan==2.18.0.0
python-dateutil==2.8.2
python-dotenv==0.19.0
python-geohash==0.8.5
python-ldap==3.3.1
python3-openid==3.2.0
pytz==2021.1
PyYAML==5.4.1
redis==3.5.3
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
s3transfer==0.5.0
selenium==3.141.0
setuptools-git==1.2
simplejson==3.17.5
six==1.16.0
slackclient==2.5.0
snowflake-connector-python==2.6.2
snowflake-sqlalchemy==1.2.4
SQLAlchemy==1.3.24
sqlalchemy-hana==0.5.0
SQLAlchemy-Utils==0.36.8
sqlparse==0.3.0
tabulate==0.8.9
tqdm==4.62.3
typing-extensions==3.10.0.2
ujson==4.2.0
urllib3==1.26.7
vine==1.3.0
webencodings==0.5.1
Werkzeug==1.0.1
WTForms==2.3.3
WTForms-JSON==0.3.3
xlrd==2.0.1
yarl==1.6.3
zipp==3.6.0
zope.event==4.5.0
zope.interface==5.4.0
@rumbin rumbin added the #bug Bug report label Oct 8, 2021
@rumbin rumbin changed the title [box plot] rows pill shows aggregated number; rol limit can be hit silently [box plot] rows pill shows aggregated number; row limit can be hit silently Oct 8, 2021
@junlincc junlincc added the viz:charts:boxplot Related to the Boxplot chart label Oct 12, 2021
@stale
Copy link

stale bot commented Apr 17, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 17, 2022
@rumbin
Copy link
Contributor Author

rumbin commented Apr 18, 2022

This issue should not be closed by the stale bot, in my eyes.

@stale stale bot removed the inactive Inactive for >= 30 days label Apr 18, 2022
@rumbin
Copy link
Contributor Author

rumbin commented Feb 16, 2023

bump

@rumbin
Copy link
Contributor Author

rumbin commented Jan 3, 2024

bump

@rusackas
Copy link
Member

I don't think "bump" comments actually help get this to the top of anyone's inbox. Are you interested in providing a fix for this at all? It's legit, and nearly identical to the other issue you posted here: #26402

Perhaps you'd want to carry a similar fix from the other PR (#27700) that solved your other problem over to this chart?

@rumbin
Copy link
Contributor Author

rumbin commented May 13, 2024

Sorry, I should have rephrased it to "please don't close, as the issue is still existing in the most recent release and I feel that it's important enough to remain open for better traceability/visibility".

I'm sorry that I don't have the resources to resolve the issue myself. All I can do is raise awareness of the criticality of it.

@rumbin
Copy link
Contributor Author

rumbin commented May 13, 2024

Actually, we can also close it.
My recommendation for users who want to visualize the distribution of datasets that exceed the implicit or explicit row limits is to rather use box plot charts instead and delegate the aggregation to the database.
The same recommendation applies to histogram and CDF plots.

@rumbin rumbin closed this as completed May 13, 2024
@TruongQuynhNhu
Copy link

TruongQuynhNhu commented Jun 3, 2024

hi @rumbin, I have the same issue as you, that the plot's statistic is different because the row limit is exceeded, can you please share how you resolve this? thank you.

@rlcole
Copy link

rlcole commented Oct 21, 2024

hi @rumbin, I too don't understand your recommendation. Can you please clarify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug Bug report viz:charts:boxplot Related to the Boxplot chart
Projects
None yet
Development

No branches or pull requests

5 participants