Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF chunking using extra memory with Iris 2.4 #4107

Closed
alastair-gemmell opened this issue Apr 26, 2021 · 11 comments
Closed

NetCDF chunking using extra memory with Iris 2.4 #4107

alastair-gemmell opened this issue Apr 26, 2021 · 11 comments

Comments

@alastair-gemmell
Copy link
Contributor

On behalf of Scitools user(s):

I have an Iris/Dask/NetCDF question about chunking data and collapsing dimensions, in this case to compute ensemble statistics (ensemble mean/median/standard deviation fields).

The NetCDF files that i'm using have chunking is optimized for this operation. Rather than using the NetCDF file's chunking, Iris/Dask is putting each ensemble member into its own chunk. I think this is resulting in my code trying to read all of the data into memory at once. Is there a simple way to make make Iris use the NetCDF file's chunking when it loads NetCDF data into a Dask array?

Also, the memory requirement for this and other operations seem to have gone up noticeably when moving from Iris 2.2 to Iris 2.4. I've been busy doubling my memory allocations to allow for it. Running on Iris 2.2 this task took a bit over an hour to compute with 16GB RAM allocated. With Iris 2.4 it no longer fits into that memory or within a 2 hour job.

To recreate issue:

First, I created two Conda installations, one with Iris v2.2 and one with Iris v2.4:
conda create --name iris-v2.2 -c conda-forge python=3.7 iris=2.2 memory_profiler
conda create --name iris-v2.4 -c conda-forge python=3.7 iris=2.4 memory_profiler

Then, after activating each environment, I ran (python code below):
python -m mprof run dask_issue.py
mprof plot

For Iris v2.2:
Loading the data:
Has lazy data: True
Chunksize: (1, 1, 36, 72)

Extracting a single year:
Has lazy data: True
Chunksize: (1, 1, 36, 72)

Realising the data of the single year:
Has lazy data: False
Chunksize: (9, 12, 36, 72)

For Iris v2.4:
Loading the data:
Has lazy data: True
Chunksize: (1, 2028, 36, 72)

Extracting a single year:
Has lazy data: True
Chunksize: (1, 12, 36, 72)

Realising the data of the single year:
Has lazy data: False
Chunksize: (9, 12, 36, 72)

The plots produced by memory_profiler shows the memory usage when realising extracted data using Iris v2.4 is over 5 times the memory usage when realising extracted data using Iris v2.2. The time taken to realise the data also increases. The cube prior to extraction contains data from 1850 to 2018. Is it possible that all the data (rather than just the year that had been extracted) are being realised when using Iris v2.4?

dask_issue.py code referenced above:

#!/usr/bin/env python
import iris

def main():
print('Loading the data:')
cube = load_cube()
print_cube_info(cube)

    print('Extracting a single year:')
    cube = extract_year(cube)
    print_cube_info(cube)

    print('Realising the data of the single year:')
    realise_data(cube)
    print_cube_info(cube)


@profile
def load_cube():
    filename = ('[path]/HadSST4/analysis/HadCRUT.5.0.0.0.SST.analysis.anomalies.?.nc')
    cube = iris.load_cube(filename)
    return cube


@profile
def extract_year(cube):
    year = 1999
    time_constraint = iris.Constraint(
        time=lambda cell: cell.point.year == year)
    cube = cube.extract(time_constraint)
    return cube


@profile
def realise_data(cube):
    cube.data


def print_cube_info(cube):
    tab = ' ' * 4
    print(f'{tab}Has lazy data: {cube.has_lazy_data()}')
    print(f'{tab}Chunksize: {cube.lazy_data().chunksize}')


if __name__ == '__main__':
main()
@DPeterK
Copy link
Member

DPeterK commented Apr 26, 2021

Looks like NetCDF chunk handling was changed in Iris between v2.2 and v2.4, here: #3361. That might be a good place to start looking; it's possible that that change had an uncaught consequence, and this is what's reported here.

For the sake of the original reporter...

Is there a simple way to make make Iris use the NetCDF file's chunking when it loads NetCDF data into a Dask array?

Iris already does - this was introduced in #3131.

@alastair-gemmell have you checked what the behaviour is in Iris v3?

Something else that would be interesting to know is what NetCDF reports the dataset chunking as for these problem files. For example:

import netCDF4
ds = netCDF4.Dataset("/path/to/my-file.nc")
print(ds.variables["my-data-var-name"].chunking())

@trexfeathers
Copy link
Contributor

Looks like NetCDF chunk handling was changed in Iris between v2.2 and v2.4, here: #3361. That might be a good place to start looking; it's possible that that change had an uncaught consequence, and this is what's reported here.

I did some cursory analysis on this problem and I can confirm it stems from the changes made in #3361 - if I substitute the v2.2.0 version of _lazy_data.py into the latest Iris the behaviour replicates the v2.2 behaviour described.

@cpelley
Copy link

cpelley commented Oct 11, 2021

I suspect we (IMPROVER) are also effected by this issue, see metoppv/improver#1579.
We are seeing an order of magnitude slowdown in some cases due to the unnecessary reading of data :(

@cpelley
Copy link

cpelley commented Oct 12, 2021

I suspect we (IMPROVER) are also effected by this issue...

Confirmed this issue is what is effecting us - substituting as_lazy_data as per #4107 (comment) results in considerable performance improvements.

@trexfeathers
Copy link
Contributor

This is the most frustrating sort of software issue 😆

We made the change to fix some users' known performance issues. But many other users were silent before, because performance was acceptable, but now they are the ones experiencing performance issues.

For us to have been aware before making the change, we would have needed a very comprehensive benchmark suite, and unfortunately that wasn't in place at the time nor is there resource to develop it currently.

The potential solution I'm aware of that offers acceptable performance for everyone is much greater user configurability of chunking, but that doesn't appear to be a simple thing to deliver (#3333).

@cpelley
Copy link

cpelley commented Oct 12, 2021

Thanks for getting back to me on this @trexfeathers, appreciated.

The potential solution I'm aware of that offers acceptable performance for everyone is much greater user configurability of chunking, but that doesn't appear to be a simple thing to deliver (#3333).

Yes. I think this has always been my issue with dask - necessarily having to expose such low level parameters to end users :(
Seems dask expects you to know what you will be giving it (what form it takes) and what you will be doing with it ahead of time.

Is there no possibility (development path in dask) where dask arrays can be made to genuinely re-chunk the original dask array? (they are currently immutable)

FYI @benfitzpatrick, @BelligerG

@trexfeathers
Copy link
Contributor

trexfeathers commented Oct 12, 2021

Is there no possibility (development path in dask) where dask arrays can be made to genuinely re-chunk the original dask array? (they are currently immutable)

@cpelley I had thought the root of the problem was the shape of re-chunking. Without knowing the 'correct' chunking shape, wouldn't the quoted idea still manifest the same problem?

@pp-mo
Copy link
Member

pp-mo commented Feb 9, 2022

Possibly some progress : #4572 aims to address #3333

@cpelley Is there no possibility (development path in dask) where dask arrays can be made to genuinely re-chunk the original dask array? (they are currently immutable)

I really don't think this is likely.
Chunking as a static ahead-of-time time decision is deeply wired in Dask. Because, I think, its concept of a graph requires that chunks behave like 'delayed' objects -- i.e. they are graph elements representing a black-box task with a single result. So, the chunks make the graph, and different chunks will mean a new graph.

@trexfeathers
Copy link
Contributor

@ehogan & @cpmorice you were the ones that originally brought this to our attention, thanks very much for your vigilance!

We have an open internal support ticket for it, but this has since moved well beyond support - we diagnosed the precise cause, and as you can see here it's difficult to find a resolution that gives everyone acceptable performance.

I'm therefore closing the internal support ticket. I wanted to tag you both here so that you can monitor our latest thoughts on the issue, should you wish.

Copy link
Contributor

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

@github-actions github-actions bot added the Stale A stale issue/pull-request label Jan 28, 2024
@trexfeathers
Copy link
Contributor

Gonna keep this open until the release of 3.8 (#5363), since #5588 should allow users to avoid this problem. Would be good to know if this works in action.

@trexfeathers trexfeathers removed the Stale A stale issue/pull-request label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants