-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Flox with cubed #224
Comments
It's implemented in both We should be able to "just" change def identity(x):
return x With this, |
So I tried to hack flox into accepting For
|
Could delete it. The default in |
Is it? Looks like it's Commenting out the concatenate This is for 120 chunks, using mean = ds.groupby("time.dayofyear").mean(skipna=False, method='map-reduce') # skipna=False to avoid eager load, see xarray issue #7243 Notice also that because this is the This doesn't actually execute though (unsuprisingly), --------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/tenacity/__init__.py:382, in Retrying.__call__(self, fn, *args, **kwargs)
381 try:
--> 382 result = fn(*args, **kwargs)
383 except BaseException: # noqa: B902
File ~/Documents/Work/Code/cubed/cubed/runtime/executors/python.py:10, in exec_stage_func(func, *args, **kwargs)
8 @retry(stop=stop_after_attempt(3))
9 def exec_stage_func(func, *args, **kwargs):
---> 10 return func(*args, **kwargs)
File ~/Documents/Work/Code/cubed/cubed/primitive/blockwise.py:67, in apply_blockwise(out_key, config)
65 args.append(arg)
---> 67 result = config.function(*args)
68 if isinstance(result, dict): # structured array with named fields
File ~/Documents/Work/Code/cubed/cubed/primitive/blockwise.py:259, in fuse.<locals>.fused_func(*args)
258 def fused_func(*args):
--> 259 return pipeline2.config.function(pipeline1.config.function(*args))
TypeError: <lambda>() got an unexpected keyword argument 'axis' Because of the way cubed retries things the traceback doesn't give me any more useful context than this though. Is there a way to resurface the useful part of the traceback @tomwhite? |
This might help: https://tenacity.readthedocs.io/en/latest/index.html#error-handling. If so we should add |
Thanks for the rapid reply! I tried adding that into cubed's python executor but the traceback still has me none the wiser as to where ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[17], line 1
----> 1 result['asn'].compute()
File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:1102, in DataArray.compute(self, **kwargs)
1083 """Manually trigger loading of this array's data from disk or a
1084 remote source into memory and return a new array. The original is
1085 left unaltered.
(...)
1099 dask.compute
1100 """
1101 new = self.copy(deep=False)
-> 1102 return new.load(**kwargs)
File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:1076, in DataArray.load(self, **kwargs)
1058 def load(self: T_DataArray, **kwargs) -> T_DataArray:
1059 """Manually trigger loading of this array's data from disk or a
1060 remote source into memory and return this array.
1061
(...)
1074 dask.compute
1075 """
-> 1076 ds = self._to_temp_dataset().load(**kwargs)
1077 new = self._from_temp_dataset(ds)
1078 self._variable = new._variable
File ~/Documents/Work/Code/xarray/xarray/core/dataset.py:792, in Dataset.load(self, **kwargs)
789 chunkmanager = get_chunked_array_type(*lazy_data.values())
791 # evaluate all the chunked arrays simultaneously
--> 792 evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs)
794 for k, data in zip(lazy_data, evaluated_data):
795 self.variables[k].data = data
File ~/Documents/Work/Code/cubed-xarray/cubed_xarray/cubedmanager.py:69, in CubedManager.compute(self, *data, **kwargs)
66 def compute(self, *data: "CubedArray", **kwargs) -> tuple[np.ndarray, ...]:
67 from cubed import compute
---> 69 return compute(*data, **kwargs)
File ~/Documents/Work/Code/cubed/cubed/core/array.py:410, in compute(executor, callbacks, optimize_graph, resume, *arrays, **kwargs)
407 executor = PythonDagExecutor()
409 _return_in_memory_array = kwargs.pop("_return_in_memory_array", True)
--> 410 plan.execute(
411 executor=executor,
412 callbacks=callbacks,
413 optimize_graph=optimize_graph,
414 resume=resume,
415 array_names=[a.name for a in arrays],
416 **kwargs,
417 )
419 if _return_in_memory_array:
420 return tuple(a._read_stored() for a in arrays)
File ~/Documents/Work/Code/cubed/cubed/core/plan.py:202, in Plan.execute(self, executor, callbacks, optimize_graph, resume, array_names, **kwargs)
197 if callbacks is not None:
198 [
199 callback.on_compute_start(dag, resume=resume)
200 for callback in callbacks
201 ]
--> 202 executor.execute_dag(
203 dag,
204 callbacks=callbacks,
205 array_names=array_names,
206 resume=resume,
207 **kwargs,
208 )
209 if callbacks is not None:
210 [callback.on_compute_end(dag) for callback in callbacks]
File ~/Documents/Work/Code/cubed/cubed/runtime/executors/python.py:22, in PythonDagExecutor.execute_dag(self, dag, callbacks, array_names, resume, **kwargs)
20 if stage.mappable is not None:
21 for m in stage.mappable:
---> 22 exec_stage_func(stage.function, m, config=pipeline.config)
23 if callbacks is not None:
24 event = TaskEndEvent(array_name=name)
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/tenacity/__init__.py:289, in BaseRetrying.wraps.<locals>.wrapped_f(*args, **kw)
287 @functools.wraps(f)
288 def wrapped_f(*args: t.Any, **kw: t.Any) -> t.Any:
--> 289 return self(f, *args, **kw)
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/tenacity/__init__.py:379, in Retrying.__call__(self, fn, *args, **kwargs)
377 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs)
378 while True:
--> 379 do = self.iter(retry_state=retry_state)
380 if isinstance(do, DoAttempt):
381 try:
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/tenacity/__init__.py:325, in BaseRetrying.iter(self, retry_state)
323 retry_exc = self.retry_error_cls(fut)
324 if self.reraise:
--> 325 raise retry_exc.reraise()
326 raise retry_exc from fut.exception()
328 if self.wait:
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/tenacity/__init__.py:158, in RetryError.reraise(self)
156 def reraise(self) -> t.NoReturn:
157 if self.last_attempt.failed:
--> 158 raise self.last_attempt.result()
159 raise self
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/concurrent/futures/_base.py:439, in Future.result(self, timeout)
437 raise CancelledError()
438 elif self._state == FINISHED:
--> 439 return self.__get_result()
441 self._condition.wait(timeout)
443 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/concurrent/futures/_base.py:391, in Future.__get_result(self)
389 if self._exception:
390 try:
--> 391 raise self._exception
392 finally:
393 # Break a reference cycle with the exception in self._exception
394 self = None
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/tenacity/__init__.py:382, in Retrying.__call__(self, fn, *args, **kwargs)
380 if isinstance(do, DoAttempt):
381 try:
--> 382 result = fn(*args, **kwargs)
383 except BaseException: # noqa: B902
384 retry_state.set_exception(sys.exc_info()) # type: ignore[arg-type]
File ~/Documents/Work/Code/cubed/cubed/runtime/executors/python.py:10, in exec_stage_func(func, *args, **kwargs)
8 @retry(reraise=True, stop=stop_after_attempt(3))
9 def exec_stage_func(func, *args, **kwargs):
---> 10 return func(*args, **kwargs)
File ~/Documents/Work/Code/cubed/cubed/primitive/blockwise.py:67, in apply_blockwise(out_key, config)
64 arg = arr[chunk_key]
65 args.append(arg)
---> 67 result = config.function(*args)
68 if isinstance(result, dict): # structured array with named fields
69 for k, v in result.items():
File ~/Documents/Work/Code/cubed/cubed/primitive/blockwise.py:259, in fuse.<locals>.fused_func(*args)
258 def fused_func(*args):
--> 259 return pipeline2.config.function(pipeline1.config.function(*args))
TypeError: <lambda>() got an unexpected keyword argument 'axis' |
Your |
It's confusing because the default for Perhaps we could add a |
I've opened cubed-dev/cubed#226 to fix this. |
From what I can tell, Flox only uses the shape of blocks, not the values themselves. So would it be possible to implement this using |
Maybe but I don't know that its worth it. Perhaps with |
Right now it would just be nice to make a groupby work on at least one case to test potential performance.
^ Thanks - changing this leads to some kind of cubed error: File ~/Documents/Work/Code/cubed/cubed/primitive/blockwise.py:70, in apply_blockwise(out_key, config)
69 for k, v in result.items():
---> 70 config.write.open().set_basic_selection(out_chunk_key, v, fields=k)
71 else:
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/zarr/core.py:1486, in Array.set_basic_selection(self, selection, value, fields)
1485 else:
-> 1486 return self._set_basic_selection_nd(selection, value, fields=fields)
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/zarr/core.py:1790, in Array._set_basic_selection_nd(self, selection, value, fields)
1788 indexer = BasicIndexer(selection, self)
-> 1790 self._set_selection(indexer, value, fields=fields)
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/zarr/core.py:1802, in Array._set_selection(self, indexer, value, fields)
1792 def _set_selection(self, indexer, value, fields=None):
1793
1794 # We iterate over all chunks which overlap the selection and thus contain data
(...)
1800
1801 # check fields are sensible
-> 1802 check_fields(fields, self._dtype)
1803 fields = check_no_multi_fields(fields)
File ~/miniconda3/envs/cubed_xarray/lib/python3.9/site-packages/zarr/indexing.py:854, in check_fields(fields, dtype)
853 if dtype.names is None:
--> 854 raise IndexError("invalid 'fields' argument, array does not have any fields")
855 try:
IndexError: invalid 'fields' argument, array does not have any fields |
It's quite hard to debug without the code, but it's worth trying to turn off graph optimization (fuse) by passing I can also take a closer look if there's some code you are able to share. |
https://gist.github.com/TomNicholas/c50b89eeb3ec9e8e49368f811689fd65 |
Thanks for the notebook @TomNicholas. I tried running it, and managed to reproduce the problem. I think what's happening is that Dask doesn't care about the dtype or shape of intermediate values in the reduction (see e.g. this comment), whereas Cubed does, since it is materializing the intermediates as Zarr arrays, so the dtypes and shapes must be correct. Another problem is that Dask's I'm going to have a look at seeing what it would take to get a Flox unit test running with Cubed. |
We would need to migrate to using structured arrays as the intermediates (which dask now supports IIRC). Or totally simplify by making the intermediates simple arrays. These would be very major changes, I would just move on with the blogpost. |
I just reached the same conclusion after looking at it for a while! |
Sorry! I did have the same realization many months ago but didn't write it down and totally forgot about it ! |
No problem! It was quite instructive looking through the code. A couple of things I noticed:
|
I've been thinking about this some. I think For "cohorts", I think we'll want to think about what makes sense for cubed. The current implementation really adapts to dask's preferences at the moment by (practically speaking) choosing to shuffle chunks, instead of subsets of a chunk. Perhaps for cubed a more explicit shuffle makes more sense: calculate the blockwise intermediates, but write them to multiple intermediate zarrs (one per cohort), and then tree-reduce those independently. In any case, the fundamental idea is that some group labels tend to occur together (particularly for quasi-periodic time groupings PS: |
Great! Does Flox need to use the Array API for this to work, or is that not needed?
BTW I'm planning on adding |
Not needed AFAICT. I'm also OK to
Aside from algorithmic changes, isn't this just PS: It'd be nice if both dask and cubed allowed |
Yes |
FYI my original proposal for what is now "cohorts" was quite different. The idea came to me as "shuffling chunks" so that members of a group would be in nearby chunks, so that https://gist.github.com/dcherian/6ccd76d2a6eaadb7844d61d197a8b3db If cubed can do the 'block shuffling' without actually rewriting data, it might be a good idea still. The downside is that you may end up with massive chunks at the end of the reduction. For e.g. consider doing Just though I'd bring it up since you are reimplementing these ideas. EDIT: Line 377 in efd88e1
|
What would need to happen to use flox with cubed instead of dask?
(I see this question as part of fully solving pydata/xarray#6807.)
In the code I see
blockwise
being used, but alsodask.array.reductions._tree_reduce
and customHighLevelGraph
objects being created. Is there any combination of arguments to flox that only usesblockwise
? Could more of flox be made to work via the blockwise abstraction? Should we add._tree_reduce
to our list of useful computation patterns for chunked arrays that includesblockwise
,map_blocks
, andapply_gufunc
?Sorry if we already discussed this elsewhere.
EDIT: cc @tomwhite
The text was updated successfully, but these errors were encountered: