-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk control #3361
Chunk control #3361
Conversation
I think I've finally got what I want out of this now. |
Whoops. ⏰ Hopefully fixed. |
Evidence for this fixing the original problem in #3357 ...
"After" examples :
|
Thanks! I confirm that this fixes the issue #3357. |
lib/iris/_lazy_data.py
Outdated
@@ -23,12 +23,14 @@ | |||
from __future__ import (absolute_import, division, print_function) | |||
from six.moves import (filter, input, map, range, zip) # noqa | |||
|
|||
from collections import Iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in Python3.7, you get the following DeprecationWarning
:
>>> from collections import Iterable
__main__:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
Could you adopt the following pattern:
try: # Python 3
from collections.abc import Iterable
except ImportError: # Python 2
from collections import Iterable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, will do !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
Just a few questions...
lib/iris/_lazy_data.py
Outdated
# Fetch the default 'optimal' chunksize from the dask config. | ||
limit = dask.config.get('array.chunk-size') | ||
# Convert to bytes | ||
limit = da.core.parse_bytes(limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you chose to get parse_bytes
from da.core
rather than from dask.utils
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I copied this from the dask sourcecode somewhere.
I will fix it.
lib/iris/_lazy_data.py
Outdated
|
||
# Create result chunks, starting with a copy of the input. | ||
result = list(chunks) | ||
if shape is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would shape be None
? I don't think we should be allowing for shape=None
. you also iterate through shape on line 105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was trying to mimic the API of dask.array.core.normalize_chunks
, in case we can use that in the future.
Actually I haven't achieved that, and we never use this option, so I will remove it.
lib/iris/fileformats/netcdf.py
Outdated
@@ -511,7 +511,7 @@ def _get_cf_var_data(cf_var, filename): | |||
proxy = NetCDFDataProxy(cf_var.shape, dtype, filename, cf_var.cf_name, | |||
fill_value) | |||
chunks = cf_var.cf_data.chunking() | |||
# Chunks can be an iterable, None, or `'contiguous'`. | |||
# Chunks can be an iterable, or `'contiguous'`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change. You have removed None
and yet two lines down it sets
chunks = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two different sets of conventions here.
- The 'chunks' value that nc.Variable.data_chunking returns can not be None, I believe.
- I don't know quite why it ever said that : it just seems wrong to me.
- The 'chunks' keyword we pass to as_lazy_data can be None. And it absolutely can't be 'contiguous', which is why we are converting here.
I will try to clarify in the comments.
lib/iris/_lazy_data.py
Outdated
# Return chunks unchanged, for types of invocation we don't comprehend. | ||
if (any(elem <= 0 for elem in shape) or | ||
not isinstance(chunks, Iterable) or | ||
len(chunks) != len(shape)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have explicit tests for this check.
I'm not sure how thorough we want to be with testing this. It does seem like a bit of overkill to add tests for
_optimum_chunksize((200,300), (1,200,300))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an attempt to allow alternative, dask-type chunks arguments in 'as_lazy_data'.
Obviously we don't use anything like that at present. The only immediate need is to skip shapes with a zero in them (see comment).
I now see this is wrong anyway, as that initial test clause assumes that shape is iterable !
I will simplify to just what we need, and add a testcase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... after some thought, I have removed this to the caller + documented the behaviour there.
Attempted to address review comments. Note : I didn't make any changes to address the comment about the size of divided chunks. |
limitcall_patch = self.patch('iris._lazy_data._optimum_chunksize') | ||
test_shape = (2, 1, 0, 2) | ||
data = self._dummydata(test_shape) | ||
result = as_lazy_data(data, chunks=test_shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F841 local variable 'result' is assigned to but never used
Thanks @lbdreyer this fixes the outstanding errors. I'm happy that those other uses in testing of as_lazy_data(.. chunks=X ..) are all now unnecessary, and we don't need to support any more complex usages : We accept that this 'chunks' keyword is not much like the dask one, and that is ok : the new docstring reflects this I hope. Regarding the "lost" test, |
Meanwhile, though ... I'm still wondering about your comment "This does end with an array that is unequally split into chunks" So far : a good practical example ...
(given the current default dask chunksize, which is 128 Mb) To be continued ... |
Interested ? @cpelley |
Hi @lbdreyer |
Hi again @lbdreyer |
This is a really great change! 💯 Thanks for persisting with it @pp-mo! |
Revised chunking policy, which will mostly affect loading from netcdf files.
Key change : allows multiplying up as well as dividing chunks, mainly to fix the #3357.
Closes #3357 #3362