Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit tests are slow #213

Open
ZedThree opened this issue Sep 1, 2021 · 9 comments
Open

Unit tests are slow #213

ZedThree opened this issue Sep 1, 2021 · 9 comments

Comments

@ZedThree
Copy link
Member

ZedThree commented Sep 1, 2021

The unit tests take really quite a long time to run. I used the pytest profiling extension, and it looks like creating and opening datasets takes the vast majority of the time.

Is there a lot of reading/writing to disk? Would it be possible to just do things in memory?

@johnomotani
Copy link
Collaborator

I did implement creating-in-memory in #132. I thought it was being used in most places (there's a save-to-disk argument to the input-creation functions that defaults to not writing). It's possible we need to switch more of the tests over. But I think the combining-into-a-single-Dataset operation may be relatively expensive. The data is created with a lot of small variations on the inputs in different tests or when parameterizing - it might still be worth caching the created datasets, and maybe trying to reduce the number of variations, but it's probably quite a faff to do!

@johnomotani
Copy link
Collaborator

I had an idea for caching the input data, to avoid re-creating it. Working on it now, PR hopefully coming soon...

@johnomotani
Copy link
Collaborator

johnomotani commented Sep 7, 2021

Looking at how long different tests take, the biggest chunks of time are the parallel interpolation tests (which you'd expect to be computationally heavy because they have four FFT transforms as well as the interpolation stencil operation), and the Region tests and TestPlot::test_region_disconnecteddoublenull, which have a lot of different cases to cover. It might be worth someone taking a look at those two sets in particular: are there more cases than we really need? Could the tests be done with smaller arrays without compromising anything? For TestPlot::test_region_disconnecteddoublenull specifically, it might be worth splitting into one test that includes all the plot functions currently included for one or a few combinations of input parameters, and a second that only uses one plot function (say pcolormesh) to test all the parameters.

@ZedThree
Copy link
Member Author

xarray 2022.10 causes further slow down of the tests -- deep copying seems to have gotten x20 slower

See, e.g. #252 (comment)

For TestPlot::test_region_disconnecteddoublenull specifically, it might be worth splitting into one test that includes all the plot functions currently included for one or a few combinations of input parameters, and a second that only uses one plot function (say pcolormesh) to test all the parameters.

Sounds good

@ZedThree
Copy link
Member Author

The deep copying is more expensive because it's now copying DataArrays in attributes -- and the Regions have two or three nested levels of this. So ways of removing that nesting are probably worth investigating. Can the Regions make use of views somehow perhaps?

@johnomotani
Copy link
Collaborator

hmm... wonder if we actually need to deep-copy at all? I suspect the idea at the time was that it would be safer, but maybe it's not needed...

@ZedThree
Copy link
Member Author

It seems that prior to 2022.10, deep copying didn't deep copy the attributes (and therefore regions?), so presumably a shallow copy should be fine now?

I switched some deep copies to shallow ones and it did make a substantial difference to the one test I've been running, but it's still x10 slower than before.

@TomNicholas
Copy link
Collaborator

Can I ask why xBOUT stores arrays of data in the attrs?

Whilst xarray's .attrs can technically store arbitrary objects, it's intended for storing small metadata (e.g. names of units, provenance, explanation of what variables represent etc.). Putting entire xr.DataArrays in the .attrs is not an intended use at all. If xBOUT didn't put actual data arrays in the attrs then the difference between deep- or shallow-copying metadata wouldn't matter.

@johnomotani
Copy link
Collaborator

johnomotani commented Jan 20, 2023

The DataArrays in attrs were in the 'regions'. They were actually just a few scalars, so probably makes sense to switch stick in .values to have them as just numbers.

This change is a 4-liner, and substantially reduces the number of errors we get from xarray-2022.9.0. I'll make a draft PR as a starting point for fixing #275.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants