Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory load efficiency for shape_availability calculation #243

Merged
merged 12 commits into from
Jun 14, 2022

Conversation

calvintr
Copy link
Contributor

@calvintr calvintr commented Jun 6, 2022

Changes proposed in this Pull Request

Improving the efficiency of shape_availibility() with respect to memory load through changes in dtypes and the method of mask summation.

Description

Two main changes are made to reduce the memory load while running shape_availability() within gis.py:

  • dtype transformations via np.astype() to int32 are removed on several occasions to keep matrices returned from functions as rasterio.features.geometry_mask() and scipy.ndimage.morphology.binary_dialation() in dtype bool.
    The dtype transformation to float64, applied to the final mask that is returned by the function, is removed.
  • Instead of saving individual masks for every exclusion raster or geometry to a list and summing them on the return, a new method is introduced, which adds the masks on every loop iteration. This is done via the | OR operator, to keep the single mask in memory as dtype bool.

Motivation and Context

With higher resolution rasters or greater land area covered, the underlying matrices, when calculating the eligible area via shape_availability(), quickly grow in size. E.g., a raster with 50 meter resolution bound by the shape of Germany produces an array of shape (17086, 12679). With the current method of storing an individual mask for every raster added to the ExclusionContainer() in dtype int32 this can lead to infeasible memory usage for conventional systems. The changes enhance the efficiency of storing information, as well the method of combining information form multiple rasters.

How Has This Been Tested?

So far tested locally with the latest version of atlite. Tests included rasters with buffers, inverted rasters as well as geometries.
pytest did not bring up unexpected errors.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I tested my contribution locally and it seems to work fine.
  • I locally ran pytest inside the repository and no unexpected problems came up.
  • I have adjusted the docstrings in the code appropriately.
  • I have documented the effects of my code changes in the documentation doc/.
  • I have added newly introduced dependencies to environment.yaml file.
  • I have added a note to release notes doc/release_notes.rst.
  • I have used pre-commit run --all to lint/format/check my contribution

Copy link
Contributor

@FabianHofmann FabianHofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great @calvintr, still need to test it. Some comments on the code below.

atlite/gis.py Outdated Show resolved Hide resolved
atlite/gis.py Outdated Show resolved Hide resolved
atlite/gis.py Outdated Show resolved Hide resolved
@calvintr calvintr force-pushed the fix-shape_avialability-efficiency branch from f3db671 to 283369f Compare June 8, 2022 15:55
@euronion
Copy link
Collaborator

euronion commented Jun 9, 2022

Running without and with this patch land availability investigation for Italy as an example:

  • One geometry layer (WDPA)
  • One raster layer (slope derived using GEBCO data)

Output of %mprun for both scenarios:
image

  • Lines 447, 467: Same change in memory usage when generating the masked array
  • Lines 464, 468: No memory increase from converting into a different datatype or temporarily storing the masked version in a list (in patched version)

Overall significantly lower memory footprint. Nice! I'll run a test with more rasters and geometries.

@euronion
Copy link
Collaborator

euronion commented Jun 9, 2022

Comparison with multiple rasters and geometries:

  • Notice the overall significantly lower Mem usage inside the function which continous to increase as more data is temporarily stored in the list exclusions (no-patch version left side)
  • Some peak Increment values when large datasets are processed in both, patched and unpatched, versions
  • Overall maximum Mem usage in patched version does not exceed ~4100 MiB´ despite adding multiple datasets. In the unpatched version the Mem usageexceeded14000 MiB` and would have grown even further if a larger raster/cutout or more datasets would have been selected

Looking very good @calvintr !

image

atlite/gis.py Show resolved Hide resolved
 - ensure coherent boolean type in shape_availability function
 - ensure integer type of mask in rasterio reproject function in shape_availability_reprojected
@euronion
Copy link
Collaborator

Running @FabianHofmann recent commit 31f9b0d (left) against the previous one c67bebd (centre) and one where we convert to float before returning the value (right):

image

Converting to float obviously increases the memory consumption, but only once at the end of the function. For the sake of backwards compatability I believe that is something we should do....

@FabianHofmann
Copy link
Contributor

What are the breaks you saw or have in mind? The follow-up functions run can work with it as well as plt.imshow (as used in the examples). I cannot think about any use case that is affected here. And from a conceptional point of view, boolean make more sense, just indicating whether it is eligible or not.

@calvintr
Copy link
Contributor Author

What are the breaks you saw or have in mind? The follow-up functions run can work with it as well as plt.imshow (as used in the examples). I cannot think about any use case that is affected here. And from a conceptional point of view, boolean make more sense, just indicating whether it is eligible or not.

I agree that returning the mask in boolean makes sense from a conceptional point of view.

Also, the transformation of the final mask on the return to float was what formerly caused the Memory Allocation Error for me. This should be much less likely now with the memory saving changes implemented. However, with the dtype changes and mask summation, the float transformation would now account for the biggest chunk of memory load apart from the projected_mask().

In my (limited) workflow I did not encounter any breaking errors with the boolean arrays in plotting or writing it to a new .tif file with GDAL. There is however one issue I ran into that might be relevant also for the examples in the documentation:

When summing the boolean array for large masks and multiplying it (to calculate the eligible area for example) this creates very large integers that lead to an overflow warning [RuntimeWarning: overflow encountered in long scalars] / wrong output for me, following the code example of atlite. It can be resolved with a dtype transformation on the summation. See in the attached screenshot, example is run with my commit returning the mask as dtype bool:

grafik

@euronion
Copy link
Collaborator

I get your arguments.

@calvintr Your issue is related to what my issue with changing to bool as dtype is, see here: https://stackoverflow.com/a/41705902 for the reason you presumably get the issue (tl;dr: numpy on Windows uses by default int32 in some cases).

My issue is how this feature might usually be used in regular code:
The masked array is usually used for further calculations and modificiations. Since we are changing the dtype we are also changing on how it behaves.

Take e.g. assigning NaN to masked where there are 0s:

masked[masked == 0] = np.nan

for a dtype float, that code works without issues.
However if the dtype of masked is bool, this will result in an array full of True, as np.nan is automatically casted to a boolean value and the truth value of np.nan is True.

Example:

image

We can:

  • Add a FutureWarning for users and change the return type in the 2nd-next version
  • Add a Warning about the changed dtype in the function, release notes (always) and change it right away
  • Don't change the return type and accept the fact that the casting is memory intensive

We should:
Defintely have a test which catches these types of changes. Is there something that Python offers which allows for simple checking if function return types / signatures change e.g. using pytest?
I only caught this issue by chance and I would have preferred it to be notified about it automatically by the CI :/

@FabianHofmann
Copy link
Contributor

Thinking about a middle ground where the return dtype is an int (preferably int8). Then, we do not encounter the warning posted by @calvintr and numerical operation should well-behave. The nan assignment would still lead to a weird behavior. However, I cannot think about an use-case for having nan's in a mask (what should these express? missing data? already given by nodata which is a defined int value.)
I'd go for option 2 of @euronion since it is not only solving the memory issue also speeding up the process in pypsa-eur and I don't think this is anything severe.

@euronion
Copy link
Collaborator

I don't see any benefits of the middle ground solution, more like only downsides:

  • It explicitly is meant to fix @calvintr problem
  • It breaks the nan issue, as nan cannot be represented by int arrays:
np.int8(np.nan)
ValueError: cannot convert float NaN to integer

(btw.: nodata is not directly Python but a predefined value via rasterio in our case. int doesn't have a nodata value)

  • It softens up the idea of returning True and False for the map

Based on that I think going with option 2 might be the better one to go with.

@FabianHofmann
Copy link
Contributor

Alright, then let's go for pure option 2. Just for the background of the "middle ground" option: For the availability matrix computation, the boolean masks have to be transformed to int arrays anyway, right before passing it to rasterio reproject. Meaning, the conversion would not lead to a any memory overhead and the effective change of this PR would be changing the output dtype of shape_availability from float to int instead of float to bool. On the same time, it would solve @calvintr 's warning. But that said, I prefer having a boolean output anyway.
So, I'd run final tests on the pypsa-eur workflow. @calvintr could you add a warning to the shape_availibility function and update the release notes? Then, we'd be ready to merge.

@FabianHofmann
Copy link
Contributor

Profiles and availabilities in pypsa-eur are the same (tested with rtol=1e-05, atol=1e-08). So good to go

RELEASE_NOTES.rst Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants