Chunk control #3361

pp-mo · 2019-07-25T13:19:32Z

Revised chunking policy, which will mostly affect loading from netcdf files.
Key change : allows multiplying up as well as dividing chunks, mainly to fix the #3357.

Closes #3357 #3362

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py

pp-mo · 2019-07-25T17:35:30Z

I think I've finally got what I want out of this now.
Good to go @bjlittle ? 🙏

pp-mo · 2019-07-26T09:32:37Z

Whoops. ⏰ ⚠️ 💣
Sorry guys, I didn't mean to do that ...

Hopefully fixed.

pp-mo · 2019-07-26T11:57:38Z

Evidence for this fixing the original problem in #3357 ...
Running sample testcode from there.
"Before" examples (release v2.2.1) :

Duration 25.450 s
Duration 26.748 s
Duration 27.222 s

"After" examples :

Duration 0.219 s
Duration 0.222 s
Duration 0.232 s

tkarna · 2019-07-26T12:07:45Z

Thanks! I confirm that this fixes the issue #3357.

bjlittle · 2019-07-30T23:30:25Z

@pp-mo This is also applicable to #3362

bjlittle · 2019-08-02T06:45:19Z

lib/iris/_lazy_data.py

@@ -23,12 +23,14 @@
 from __future__ import (absolute_import, division, print_function)
 from six.moves import (filter, input, map, range, zip)  # noqa

+from collections import Iterable


@pp-mo See #3320 for context.

Currently in Python3.7, you get the following DeprecationWarning:

>>> from collections import Iterable __main__:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working

Could you adopt the following pattern:

try: # Python 3 from collections.abc import Iterable except ImportError: # Python 2 from collections import Iterable

Looks good, will do !

pp-mo · 2019-08-20T11:52:08Z

@lbdreyer I think this is only failing due to numpy 1.17,
thus since #3369, a rebase should get a clean pass.
Are you ok for that, or are you still writing review comments on existing commits ?

lbdreyer

Overall looks good!

Just a few questions...

lbdreyer · 2019-08-19T15:37:16Z

lib/iris/_lazy_data.py

+        # Fetch the default 'optimal' chunksize from the dask config.
+        limit = dask.config.get('array.chunk-size')
+        # Convert to bytes
+        limit = da.core.parse_bytes(limit)


Why did you chose to get parse_bytes from da.core rather than from dask.utils?

I think I copied this from the dask sourcecode somewhere.
I will fix it.

lbdreyer · 2019-08-19T15:46:14Z

lib/iris/_lazy_data.py

+
+    # Create result chunks, starting with a copy of the input.
+    result = list(chunks)
+    if shape is None:


When would shape be None? I don't think we should be allowing for shape=None. you also iterate through shape on line 105

I think I was trying to mimic the API of dask.array.core.normalize_chunks, in case we can use that in the future.
Actually I haven't achieved that, and we never use this option, so I will remove it.

lbdreyer · 2019-08-20T13:11:45Z

lib/iris/fileformats/netcdf.py

@@ -511,7 +511,7 @@ def _get_cf_var_data(cf_var, filename):
    proxy = NetCDFDataProxy(cf_var.shape, dtype, filename, cf_var.cf_name,
                            fill_value)
    chunks = cf_var.cf_data.chunking()
-    # Chunks can be an iterable, None, or `'contiguous'`.
+    # Chunks can be an iterable, or `'contiguous'`.


I don't understand this change. You have removed None and yet two lines down it sets

chunks = None

There are two different sets of conventions here.

The 'chunks' value that nc.Variable.data_chunking returns can not be None, I believe.

I don't know quite why it ever said that : it just seems wrong to me.

The 'chunks' keyword we pass to as_lazy_data can be None. And it absolutely can't be 'contiguous', which is why we are converting here.

I will try to clarify in the comments.

lib/iris/_lazy_data.py

lbdreyer · 2019-08-20T14:12:14Z

lib/iris/_lazy_data.py

+    # Return chunks unchanged, for types of invocation we don't comprehend.
+    if (any(elem <= 0 for elem in shape) or
+            not isinstance(chunks, Iterable) or
+            len(chunks) != len(shape)):


You don't have explicit tests for this check.
I'm not sure how thorough we want to be with testing this. It does seem like a bit of overkill to add tests for
_optimum_chunksize((200,300), (1,200,300))

This was an attempt to allow alternative, dask-type chunks arguments in 'as_lazy_data'.
Obviously we don't use anything like that at present. The only immediate need is to skip shapes with a zero in them (see comment).

I now see this is wrong anyway, as that initial test clause assumes that shape is iterable !
I will simplify to just what we need, and add a testcase.

... after some thought, I have removed this to the caller + documented the behaviour there.

lib/iris/_lazy_data.py

pp-mo · 2019-08-20T17:17:32Z

Attempted to address review comments.
Please re-review @lbdreyer .

Note : I didn't make any changes to address the comment about the size of divided chunks.
We can still discuss that if wanted.

stickler-ci · 2019-08-21T11:09:08Z

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py

+        limitcall_patch = self.patch('iris._lazy_data._optimum_chunksize')
+        test_shape = (2, 1, 0, 2)
+        data = self._dummydata(test_shape)
+        result = as_lazy_data(data, chunks=test_shape)


F841 local variable 'result' is assigned to but never used

pp-mo · 2019-08-21T11:38:22Z

Thanks @lbdreyer this fixes the outstanding errors.

I'm happy that those other uses in testing of as_lazy_data(.. chunks=X ..) are all now unnecessary, and we don't need to support any more complex usages : We accept that this 'chunks' keyword is not much like the dask one, and that is ok : the new docstring reflects this I hope.

Regarding the "lost" test,
test_as_lazy_data.Test__optimised_chunks.test_large_specific_chunk_passthrough
I think that is now obsolete, because the earlier idea was that a specified chunks= should always pass through unchanged to the dask call, but that is definitely no longer the case.

pp-mo · 2019-08-21T11:41:34Z

Meanwhile, though ...

I'm still wondering about your comment "This does end with an array that is unequally split into chunks"
There could still be an improvement that can be made here, but I'm still not clear exactly how it should work.

So far : a good practical example ...

>>> _optimum_chunksize((3, 300, 200), (117, 300, 2000))
(54, 300, 2000)

(given the current default dask chunksize, which is 128 Mb)
this will result in 3 chunks with the first-dimension sizes (54, 54, 9),
but you "could have had" (39, 39, 39) instead.
I will investigate how that can be calculated, taking account of what happens to slightly-smaller and slightly-larger cases (i.e. that don't divide equally).

To be continued ...

pp-mo · 2019-08-21T14:35:27Z

Interested ? @cpelley

pp-mo · 2019-08-22T14:34:46Z

Hi @lbdreyer
Finally, I think I made sense of the point you raised above, regarding a better choice of chunk sizes.
Hope this makes sense, sorry it has proved so complicated to explain + test !
Please re-review ...

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py

pp-mo · 2019-08-23T13:20:14Z

Hi again @lbdreyer
As-per offline discussion, now simplified the testing a bit.

lbdreyer · 2019-08-23T13:38:05Z

This is a really great change! 💯

Thanks for persisting with it @pp-mo!

pp-mo added 4 commits July 16, 2019 18:42

First ideas.

097fbe8

Working towards new form for chunking policy.

dfa660d

Handle special cases of chunk+shape.

8b9300c

Header fixes.

c522e08

stickler-ci reviewed Jul 25, 2019

View reviewed changes

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py Outdated Show resolved Hide resolved

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py Outdated Show resolved Hide resolved

pp-mo added 2 commits July 25, 2019 14:21

Pep8 fixes.

3c62481

Fix tests.

1acd9f2

stickler-ci reviewed Jul 25, 2019

View reviewed changes

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py Outdated Show resolved Hide resolved

Remove redundant import.

019f8f4

stickler-ci reviewed Jul 25, 2019

View reviewed changes

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py Outdated Show resolved Hide resolved

Small improvements.

1c588fe

pp-mo force-pushed the chunk_control branch from ec0bc55 to 1c588fe Compare July 25, 2019 17:34

pp-mo requested a review from bjlittle July 25, 2019 17:35

Test fix.

f4d77c4

pp-mo mentioned this pull request Jul 26, 2019

Reading netcdf files is slow if there are unlimited dimensions #3357

Closed

Rename chunking routine and pass dtype.

0c7086c

pp-mo force-pushed the chunk_control branch from 2082639 to 0c7086c Compare July 26, 2019 12:25

bjlittle self-assigned this Jul 30, 2019

bjlittle added Release: Minor Type: Performance labels Jul 30, 2019

bjlittle added this to the v2.3.0 milestone Jul 30, 2019

bjlittle mentioned this pull request Jul 30, 2019

Iris save taking very long (hanging) with some netcdf files #3362

Closed

rswamina mentioned this pull request Jul 31, 2019

Iris chunking policy for netcdf files - new fix applied ESMValGroup/ESMValTool#1226

Open

bjlittle reviewed Aug 2, 2019

View reviewed changes

lbdreyer self-assigned this Aug 19, 2019

Use Python3 style import of 'Iterable', to avoid deprecation.

f35bbd5

lbdreyer reviewed Aug 20, 2019

View reviewed changes

lib/iris/_lazy_data.py Show resolved Hide resolved

stickler-ci reviewed Aug 20, 2019

View reviewed changes

lib/iris/_lazy_data.py Outdated Show resolved Hide resolved

Review changes.

cdc23ab

pp-mo force-pushed the chunk_control branch from 02d387c to cdc23ab Compare August 20, 2019 17:15

pp-mo added 2 commits August 21, 2019 12:00

Remove incompatible usages of as_lazy_data 'chunks' arg, in some tests.

ec18e3e

Remove redundant test code.

0cd7b0d

stickler-ci reviewed Aug 21, 2019

View reviewed changes

pp-mo added 2 commits August 21, 2019 12:20

License header fix.

229e7a7

Linter fix.

174c460

Revised optimum chunk size calculation.

a3054a6

Simplified test.

a9e798d

stickler-ci reviewed Aug 23, 2019

View reviewed changes

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py Outdated Show resolved Hide resolved

Codestyle fix.

b053dee

lbdreyer merged commit f402a19 into SciTools:master Aug 23, 2019

pp-mo mentioned this pull request Aug 27, 2019

Whatsnew for new chunking strategy. #3381

Merged

pp-mo deleted the chunk_control branch October 17, 2019 10:49

DPeterK mentioned this pull request Apr 26, 2021

NetCDF chunking using extra memory with Iris 2.4 #4107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk control #3361

Chunk control #3361

pp-mo commented Jul 25, 2019 •

edited by bjlittle

Loading

pp-mo commented Jul 25, 2019

pp-mo commented Jul 26, 2019

pp-mo commented Jul 26, 2019

tkarna commented Jul 26, 2019

bjlittle commented Jul 30, 2019

bjlittle Aug 2, 2019

pp-mo Aug 19, 2019

pp-mo commented Aug 20, 2019

lbdreyer left a comment

lbdreyer Aug 19, 2019

pp-mo Aug 20, 2019 •

edited

Loading

lbdreyer Aug 19, 2019

pp-mo Aug 20, 2019

lbdreyer Aug 20, 2019

pp-mo Aug 20, 2019

lbdreyer Aug 20, 2019

pp-mo Aug 20, 2019

pp-mo Aug 20, 2019

pp-mo commented Aug 20, 2019 •

edited

Loading

stickler-ci Aug 21, 2019

pp-mo commented Aug 21, 2019

pp-mo commented Aug 21, 2019 •

edited

Loading

pp-mo commented Aug 21, 2019

pp-mo commented Aug 22, 2019 •

edited

Loading

pp-mo commented Aug 23, 2019

lbdreyer commented Aug 23, 2019

Chunk control #3361

Chunk control #3361

Conversation

pp-mo commented Jul 25, 2019 • edited by bjlittle Loading

pp-mo commented Jul 25, 2019

pp-mo commented Jul 26, 2019

pp-mo commented Jul 26, 2019

tkarna commented Jul 26, 2019

bjlittle commented Jul 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo commented Aug 20, 2019

lbdreyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo Aug 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo commented Aug 20, 2019 • edited Loading

Choose a reason for hiding this comment

pp-mo commented Aug 21, 2019

pp-mo commented Aug 21, 2019 • edited Loading

pp-mo commented Aug 21, 2019

pp-mo commented Aug 22, 2019 • edited Loading

pp-mo commented Aug 23, 2019

lbdreyer commented Aug 23, 2019

pp-mo commented Jul 25, 2019 •

edited by bjlittle

Loading

pp-mo Aug 20, 2019 •

edited

Loading

pp-mo commented Aug 20, 2019 •

edited

Loading

pp-mo commented Aug 21, 2019 •

edited

Loading

pp-mo commented Aug 22, 2019 •

edited

Loading