Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_plot_cdfs test is extremely slow #298

Closed
EwoutH opened this issue Oct 29, 2023 · 9 comments · Fixed by #306
Closed

test_plot_cdfs test is extremely slow #298

EwoutH opened this issue Oct 29, 2023 · 9 comments · Fixed by #306

Comments

@EwoutH
Copy link
Collaborator

EwoutH commented Oct 29, 2023

This test takes extremely long, in the order of minutes. I wouldn't be surprised if it's over half of all testing time.

class Test(unittest.TestCase):
def test_plot_cdfs(self):
x, outcomes = utilities.load_flu_data()
y = outcomes["deceased population region 1"][:, -1] > 1000000
regional_sa.plot_cdfs(x, y)
regional_sa.plot_cdfs(x, y, ccdf=True)
x = x.drop("scenario", axis=1)
regional_sa.plot_cdfs(x, y, ccdf=True)

@quaquel
Copy link
Owner

quaquel commented Oct 29, 2023

Should be easy to replace the data with a smaller dataset.

@EwoutH
Copy link
Collaborator Author

EwoutH commented Oct 29, 2023

Performance profile:

Screenshot_201
Screenshot_202

Looks like matplotlib's autoscale_view is called a lot of times, which seems to be the most expensive operation.

For the test itself, smaller dataset would indeed work.

@quaquel
Copy link
Owner

quaquel commented Oct 29, 2023

No idea why autoscale_view is called so often. There are 63 calls to plot_individual_cdf, so that is close 400 calls to autoscale per cdf (this is not the only place where autoscale might be called, but likely the dominant one). It's also strange to autoscale because at least the x-axis has a clearly defined hard-coded limit. It seems this arises from somewhere deep within matplotlib.

@EwoutH
Copy link
Collaborator Author

EwoutH commented Oct 29, 2023

Here the tree:

EMAworkbench_plot_cdf_profile

It seems that from plot_discrete_cdf (called 6 times) inner is the first function that's called that 24024 times, all the way to autoscale_view.

@quaquel
Copy link
Owner

quaquel commented Oct 29, 2023

The calls to scatter in plot_discrete_cdf might be redundant and could be replaced by using marker in the plot command. Would need some checking, however.

@quaquel
Copy link
Owner

quaquel commented Oct 31, 2023

Just ran a quick test. With scatter enabled timeit gives

52 s ± 732 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With scatter disabled, and replaced with the marker kwarg in plot we get

621 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, disabling scatter gives us orders of magnitude improvement, and no discernable difference in the visual.

@EwoutH
Copy link
Collaborator Author

EwoutH commented Oct 31, 2023

That's 2 orders of magnitude, or a 99% reduction in runtime. Seems worth it.

Are there any figures where the scatter is likely required to provide a correct/useful visual? If so we should test those.

@quaquel
Copy link
Owner

quaquel commented Oct 31, 2023

I did a visual comparison. It is really marginal.

@EwoutH
Copy link
Collaborator Author

EwoutH commented Oct 31, 2023

Then the speedup sounds worth it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants