Improve memory load efficiency for shape_availability calculation #243

calvintr · 2022-06-06T20:46:12Z

Changes proposed in this Pull Request

Improving the efficiency of shape_availibility() with respect to memory load through changes in dtypes and the method of mask summation.

Description

Two main changes are made to reduce the memory load while running shape_availability() within gis.py:

dtype transformations via np.astype() to int32 are removed on several occasions to keep matrices returned from functions as rasterio.features.geometry_mask() and scipy.ndimage.morphology.binary_dialation() in dtype bool.
The dtype transformation to float64, applied to the final mask that is returned by the function, is removed.
Instead of saving individual masks for every exclusion raster or geometry to a list and summing them on the return, a new method is introduced, which adds the masks on every loop iteration. This is done via the | OR operator, to keep the single mask in memory as dtype bool.

Motivation and Context

With higher resolution rasters or greater land area covered, the underlying matrices, when calculating the eligible area via shape_availability(), quickly grow in size. E.g., a raster with 50 meter resolution bound by the shape of Germany produces an array of shape (17086, 12679). With the current method of storing an individual mask for every raster added to the ExclusionContainer() in dtype int32 this can lead to infeasible memory usage for conventional systems. The changes enhance the efficiency of storing information, as well the method of combining information form multiple rasters.

How Has This Been Tested?

So far tested locally with the latest version of atlite. Tests included rasters with buffers, inverted rasters as well as geometries.
pytest did not bring up unexpected errors.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

I tested my contribution locally and it seems to work fine.
I locally ran pytest inside the repository and no unexpected problems came up.
I have adjusted the docstrings in the code appropriately.
I have documented the effects of my code changes in the documentation doc/.
I have added newly introduced dependencies to environment.yaml file.
I have added a note to release notes doc/release_notes.rst.
I have used pre-commit run --all to lint/format/check my contribution

…ction

FabianHofmann

Great @calvintr, still need to test it. Some comments on the code below.

atlite/gis.py

Co-authored-by: Fabian Hofmann <[email protected]>

…tr/atlite into pr/calvintr/243

euronion · 2022-06-09T15:28:53Z

Running without and with this patch land availability investigation for Italy as an example:

One geometry layer (WDPA)
One raster layer (slope derived using GEBCO data)

Output of %mprun for both scenarios:

Lines 447, 467: Same change in memory usage when generating the masked array
Lines 464, 468: No memory increase from converting into a different datatype or temporarily storing the masked version in a list (in patched version)

Overall significantly lower memory footprint. Nice! I'll run a test with more rasters and geometries.

euronion · 2022-06-09T15:55:39Z

Comparison with multiple rasters and geometries:

Notice the overall significantly lower Mem usage inside the function which continous to increase as more data is temporarily stored in the list exclusions (no-patch version left side)
Some peak Increment values when large datasets are processed in both, patched and unpatched, versions
Overall maximum Mem usage in patched version does not exceed ~4100 MiB´ despite adding multiple datasets. In the unpatched version the Mem usageexceeded14000 MiB` and would have grown even further if a larger raster/cutout or more datasets would have been selected

Looking very good @calvintr !

atlite/gis.py

- ensure coherent boolean type in shape_availability function - ensure integer type of mask in rasterio reproject function in shape_availability_reprojected

euronion · 2022-06-10T12:32:28Z

Running @FabianHofmann recent commit 31f9b0d (left) against the previous one c67bebd (centre) and one where we convert to float before returning the value (right):

Converting to float obviously increases the memory consumption, but only once at the end of the function. For the sake of backwards compatability I believe that is something we should do....

FabianHofmann · 2022-06-11T10:28:59Z

What are the breaks you saw or have in mind? The follow-up functions run can work with it as well as plt.imshow (as used in the examples). I cannot think about any use case that is affected here. And from a conceptional point of view, boolean make more sense, just indicating whether it is eligible or not.

calvintr · 2022-06-11T12:04:55Z

What are the breaks you saw or have in mind? The follow-up functions run can work with it as well as plt.imshow (as used in the examples). I cannot think about any use case that is affected here. And from a conceptional point of view, boolean make more sense, just indicating whether it is eligible or not.

I agree that returning the mask in boolean makes sense from a conceptional point of view.

Also, the transformation of the final mask on the return to float was what formerly caused the Memory Allocation Error for me. This should be much less likely now with the memory saving changes implemented. However, with the dtype changes and mask summation, the float transformation would now account for the biggest chunk of memory load apart from the projected_mask().

In my (limited) workflow I did not encounter any breaking errors with the boolean arrays in plotting or writing it to a new .tif file with GDAL. There is however one issue I ran into that might be relevant also for the examples in the documentation:

When summing the boolean array for large masks and multiplying it (to calculate the eligible area for example) this creates very large integers that lead to an overflow warning [RuntimeWarning: overflow encountered in long scalars] / wrong output for me, following the code example of atlite. It can be resolved with a dtype transformation on the summation. See in the attached screenshot, example is run with my commit returning the mask as dtype bool:

euronion · 2022-06-13T07:04:12Z

I get your arguments.

@calvintr Your issue is related to what my issue with changing to bool as dtype is, see here: https://stackoverflow.com/a/41705902 for the reason you presumably get the issue (tl;dr: numpy on Windows uses by default int32 in some cases).

My issue is how this feature might usually be used in regular code:
The masked array is usually used for further calculations and modificiations. Since we are changing the dtype we are also changing on how it behaves.

Take e.g. assigning NaN to masked where there are 0s:

masked[masked == 0] = np.nan

for a dtype float, that code works without issues.
However if the dtype of masked is bool, this will result in an array full of True, as np.nan is automatically casted to a boolean value and the truth value of np.nan is True.

Example:

We can:

Add a FutureWarning for users and change the return type in the 2nd-next version
Add a Warning about the changed dtype in the function, release notes (always) and change it right away
Don't change the return type and accept the fact that the casting is memory intensive

We should:
Defintely have a test which catches these types of changes. Is there something that Python offers which allows for simple checking if function return types / signatures change e.g. using pytest?
I only caught this issue by chance and I would have preferred it to be notified about it automatically by the CI :/

FabianHofmann · 2022-06-13T09:25:08Z

Thinking about a middle ground where the return dtype is an int (preferably int8). Then, we do not encounter the warning posted by @calvintr and numerical operation should well-behave. The nan assignment would still lead to a weird behavior. However, I cannot think about an use-case for having nan's in a mask (what should these express? missing data? already given by nodata which is a defined int value.)
I'd go for option 2 of @euronion since it is not only solving the memory issue also speeding up the process in pypsa-eur and I don't think this is anything severe.

euronion · 2022-06-13T09:38:45Z

I don't see any benefits of the middle ground solution, more like only downsides:

It explicitly is meant to fix @calvintr problem
It breaks the nan issue, as nan cannot be represented by int arrays:

np.int8(np.nan)
ValueError: cannot convert float NaN to integer

(btw.: nodata is not directly Python but a predefined value via rasterio in our case. int doesn't have a nodata value)

It softens up the idea of returning True and False for the map

Based on that I think going with option 2 might be the better one to go with.

FabianHofmann · 2022-06-13T10:11:18Z

Alright, then let's go for pure option 2. Just for the background of the "middle ground" option: For the availability matrix computation, the boolean masks have to be transformed to int arrays anyway, right before passing it to rasterio reproject. Meaning, the conversion would not lead to a any memory overhead and the effective change of this PR would be changing the output dtype of shape_availability from float to int instead of float to bool. On the same time, it would solve @calvintr 's warning. But that said, I prefer having a boolean output anyway.
So, I'd run final tests on the pypsa-eur workflow. @calvintr could you add a warning to the shape_availibility function and update the release notes? Then, we'd be ready to merge.

FabianHofmann · 2022-06-13T11:56:16Z

Profiles and availabilities in pypsa-eur are the same (tested with rtol=1e-05, atol=1e-08). So good to go

…elease notes

RELEASE_NOTES.rst

calvintr added 2 commits June 4, 2022 15:22

Implementing inital efficiency improvements to shape_availability fun…

9edea4e

…ction

Changing matrix summation to boolean operation

c94e362

FabianHofmann requested changes Jun 7, 2022

View reviewed changes

atlite/gis.py Outdated Show resolved Hide resolved

atlite/gis.py Outdated Show resolved Hide resolved

atlite/gis.py Outdated Show resolved Hide resolved

calvintr and others added 2 commits June 8, 2022 17:31

Update return of final mask using NOT operator

02fa479

Co-authored-by: Fabian Hofmann <[email protected]>

Removing enumerate from iterable in response to comments

283369f

calvintr force-pushed the fix-shape_avialability-efficiency branch from f3db671 to 283369f Compare June 8, 2022 15:55

calvintr and others added 3 commits June 8, 2022 18:06

Fix wrong iterable

17c2fbc

gis: remove obsolete exclusion initialization

34a2de7

Merge branch 'fix-shape_avialability-efficiency' of github.com:calvin…

c67bebd

…tr/atlite into pr/calvintr/243

euronion reviewed Jun 9, 2022

View reviewed changes

atlite/gis.py Show resolved Hide resolved

gis:

31f9b0d

- ensure coherent boolean type in shape_availability function - ensure integer type of mask in rasterio reproject function in shape_availability_reprojected

calvintr and others added 2 commits June 13, 2022 20:00

Adding information of changed function output as Breaking Change to r…

d330a12

…elease notes

Merge branch 'master' into fix-shape_avialability-efficiency

8c62ad5

FabianHofmann reviewed Jun 14, 2022

View reviewed changes

RELEASE_NOTES.rst Outdated Show resolved Hide resolved

Fabian Hofmann and others added 2 commits June 14, 2022 14:17

Update RELEASE_NOTES.rst

3ffdac0

gis: add Userwarnings and ignore context

a3d547f

FabianHofmann approved these changes Jun 14, 2022

View reviewed changes

FabianHofmann merged commit 607f866 into PyPSA:master Jun 14, 2022

pz-max mentioned this pull request Dec 25, 2022

Move PyPSA dependency back to the main repository pypsa-meets-earth/pypsa-earth#538

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory load efficiency for shape_availability calculation #243

Improve memory load efficiency for shape_availability calculation #243

calvintr commented Jun 6, 2022

FabianHofmann left a comment

euronion commented Jun 9, 2022

euronion commented Jun 9, 2022

euronion commented Jun 10, 2022

FabianHofmann commented Jun 11, 2022

calvintr commented Jun 11, 2022

euronion commented Jun 13, 2022

FabianHofmann commented Jun 13, 2022

euronion commented Jun 13, 2022

FabianHofmann commented Jun 13, 2022

FabianHofmann commented Jun 13, 2022

Improve memory load efficiency for shape_availability calculation #243

Improve memory load efficiency for shape_availability calculation #243

Conversation

calvintr commented Jun 6, 2022

Changes proposed in this Pull Request

Description

Motivation and Context

How Has This Been Tested?

Type of change

Checklist

FabianHofmann left a comment

Choose a reason for hiding this comment

euronion commented Jun 9, 2022

euronion commented Jun 9, 2022

euronion commented Jun 10, 2022

FabianHofmann commented Jun 11, 2022

calvintr commented Jun 11, 2022

euronion commented Jun 13, 2022

FabianHofmann commented Jun 13, 2022

euronion commented Jun 13, 2022

FabianHofmann commented Jun 13, 2022

FabianHofmann commented Jun 13, 2022