Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between docstring and implementation in explanation_dashboard_input.py #1270

Closed
smrnvdhy opened this issue Mar 15, 2022 · 4 comments

Comments

@smrnvdhy
Copy link

The docstring mentions dataset must have fewer than 10000 rows and fewer than 1000 columns. However, the code checks for 100000 rows and 1000 columns.

Should be an easy fix that anyone could do. I'd fork and PR, but I don't know which one is the intended value. Slightly leaning towards 10000, because it also appears in explanation_dashboard.py docstring

@imatiach-msft
Copy link
Contributor

@smrnvdhy great find! I think we should make it 100,000 rows, since we can sometimes handle 10k-20k rows for certain datasets. However, after 15k rows usually the visualization becomes very slow to load in the browser, and past 20k rows it usually fails with some error. There are some ideas in our team on how we might be able to lift these restrictions - specifically, by doing more of the calculations/clustering in the backend and using streaming so that we don't load all of the data into the UI. I'll send a PR to update the docstrings.

@smrnvdhy
Copy link
Author

@imatiach-msft Thank you for addressing this so quickly! I also observe long load time just like you mentioned (40 minutes wall time for 14k rows). I'm wondering if you could share which functionality/calculation is making the UI slow to load. Is there an easy way to make that part optional? My team love that we can easily investigate model performance and feature importance by subgroups, but the load time is a major concern.

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Mar 16, 2022

@smrnvdhy
"which functionality/calculation is making the UI slow to load"
This is mostly due to loading all of the datapoints into the typescript (compiled to javascript) code. The browser is just not able to handle so much data. The solution seems to be to perform aggregations in the backend and then load summaries in the frontend, similar to how the ErrorAnalysisDashboard is able to load a lot of data (as long as a sample is passed for the dataset explorer, which may also be resolved in the future with similar aggregations or by streaming specific datapoints, depending on the scenario):

https://github.com/microsoft/responsible-ai-toolbox/blob/main/raiwidgets/tests/test_error_analysis_dashboard.py#L70

And also in the notebook (where the sample_dataset parameter doesn't need to be passed since the original explanation was downsampled):
https://github.com/microsoft/responsible-ai-toolbox/blob/main/notebooks/individual-dashboards/erroranalysis-dashboard/erroranalysis-interpretability-dashboard-census.ipynb

I even have a POC working of running Error Analysis Dashboard on spark (millions of rows) for the tree view.
However, for explanation dashboard currently we load all data in the dataset explorer, so the browser simply starts to fail at some point. Browsers have much lower limits than the python environment, and typescript code is also slower/less efficient. It's considered a bad pattern to have the front-end do a lot of the calculations anyway.

@imatiach-msft
Copy link
Contributor

closing issue as PR has been merged:
#1271

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants