-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetCDF chunking using extra memory with Iris 2.4 #4107
Comments
Looks like NetCDF chunk handling was changed in Iris between v2.2 and v2.4, here: #3361. That might be a good place to start looking; it's possible that that change had an uncaught consequence, and this is what's reported here. For the sake of the original reporter...
Iris already does - this was introduced in #3131. @alastair-gemmell have you checked what the behaviour is in Iris v3? Something else that would be interesting to know is what NetCDF reports the dataset chunking as for these problem files. For example: import netCDF4
ds = netCDF4.Dataset("/path/to/my-file.nc")
print(ds.variables["my-data-var-name"].chunking()) |
I did some cursory analysis on this problem and I can confirm it stems from the changes made in #3361 - if I substitute the v2.2.0 version of _lazy_data.py into the latest Iris the behaviour replicates the v2.2 behaviour described. |
I suspect we (IMPROVER) are also effected by this issue, see metoppv/improver#1579. |
Confirmed this issue is what is effecting us - substituting |
This is the most frustrating sort of software issue 😆 We made the change to fix some users' known performance issues. But many other users were silent before, because performance was acceptable, but now they are the ones experiencing performance issues. For us to have been aware before making the change, we would have needed a very comprehensive benchmark suite, and unfortunately that wasn't in place at the time nor is there resource to develop it currently. The potential solution I'm aware of that offers acceptable performance for everyone is much greater user configurability of chunking, but that doesn't appear to be a simple thing to deliver (#3333). |
Thanks for getting back to me on this @trexfeathers, appreciated.
Yes. I think this has always been my issue with dask - necessarily having to expose such low level parameters to end users :( Is there no possibility (development path in dask) where dask arrays can be made to genuinely re-chunk the original dask array? (they are currently immutable) |
@cpelley I had thought the root of the problem was the shape of re-chunking. Without knowing the 'correct' chunking shape, wouldn't the quoted idea still manifest the same problem? |
Possibly some progress : #4572 aims to address #3333
I really don't think this is likely. |
@ehogan & @cpmorice you were the ones that originally brought this to our attention, thanks very much for your vigilance! We have an open internal support ticket for it, but this has since moved well beyond support - we diagnosed the precise cause, and as you can see here it's difficult to find a resolution that gives everyone acceptable performance. I'm therefore closing the internal support ticket. I wanted to tag you both here so that you can monitor our latest thoughts on the issue, should you wish. |
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity. If this issue is still important to you, then please comment on this issue and the stale label will be removed. Otherwise this issue will be automatically closed in 28 days time. |
On behalf of Scitools user(s):
I have an Iris/Dask/NetCDF question about chunking data and collapsing dimensions, in this case to compute ensemble statistics (ensemble mean/median/standard deviation fields).
The NetCDF files that i'm using have chunking is optimized for this operation. Rather than using the NetCDF file's chunking, Iris/Dask is putting each ensemble member into its own chunk. I think this is resulting in my code trying to read all of the data into memory at once. Is there a simple way to make make Iris use the NetCDF file's chunking when it loads NetCDF data into a Dask array?
Also, the memory requirement for this and other operations seem to have gone up noticeably when moving from Iris 2.2 to Iris 2.4. I've been busy doubling my memory allocations to allow for it. Running on Iris 2.2 this task took a bit over an hour to compute with 16GB RAM allocated. With Iris 2.4 it no longer fits into that memory or within a 2 hour job.
To recreate issue:
First, I created two Conda installations, one with Iris v2.2 and one with Iris v2.4:
conda create --name iris-v2.2 -c conda-forge python=3.7 iris=2.2 memory_profiler
conda create --name iris-v2.4 -c conda-forge python=3.7 iris=2.4 memory_profiler
Then, after activating each environment, I ran (python code below):
python -m mprof run dask_issue.py
mprof plot
For Iris v2.2:
Loading the data:
Has lazy data: True
Chunksize: (1, 1, 36, 72)
Extracting a single year:
Has lazy data: True
Chunksize: (1, 1, 36, 72)
Realising the data of the single year:
Has lazy data: False
Chunksize: (9, 12, 36, 72)
For Iris v2.4:
Loading the data:
Has lazy data: True
Chunksize: (1, 2028, 36, 72)
Extracting a single year:
Has lazy data: True
Chunksize: (1, 12, 36, 72)
Realising the data of the single year:
Has lazy data: False
Chunksize: (9, 12, 36, 72)
The plots produced by memory_profiler shows the memory usage when realising extracted data using Iris v2.4 is over 5 times the memory usage when realising extracted data using Iris v2.2. The time taken to realise the data also increases. The cube prior to extraction contains data from 1850 to 2018. Is it possible that all the data (rather than just the year that had been extracted) are being realised when using Iris v2.4?
dask_issue.py code referenced above:
#!/usr/bin/env python
import iris
def main():
print('Loading the data:')
cube = load_cube()
print_cube_info(cube)
The text was updated successfully, but these errors were encountered: