-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add template xarray object kwarg to map_blocks #3816
Conversation
This accounts for dimension sizes being changed by the applied function.
This reverts commit 045ae2b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcherian! This is looking great. I ran this on my test problem and it worked right out of the box. The main TODOs I see are:
- port to
DataArray
andDataset
methods - add a few more tests (mostly around checking expected outputs with and without template)
- add an example to the docs
- get some additional eyes on this (probably @shoyer and/or @crusaderky)
* upstream/master: (54 commits) Limit repr of arrays containing long strings (pydata#3900) expose a few zarr backend functions as semi-public api (pydata#3897) Use drawstyle instead of linestyle in plot.step. (pydata#3274) Implementation of polyfit and polyval (pydata#3733) misplaced quote in whatsnew (pydata#3889) Rename ordered_dict_intersection -> compat_dict_intersection (pydata#3887) Control attrs of result in `merge()`, `concat()`, `combine_by_coords()` and `combine_nested()` (pydata#3877) xfail test_uamiv_format_write (pydata#3885) Use `fixes` in PR template (pydata#3886) Tweaks to "how_to_release" (pydata#3882) whatsnew section for 0.16.0 Release v0.15.1 whatsnew for 0.15.1 (pydata#3879) update panel documentation (pydata#3880) reword the whats-new entry for unit support (pydata#3878) Raise error when assigning to IndexVariable.values & IndexVariable.data (pydata#3862) Re-enable tests xfailed in pydata#3808 and fix new CFTimeIndex failures due to upstream changes (pydata#3874) add spacing in the versions section of the issue report (pydata#3876) map_blocks: allow user function to add new unindexed dimension. (pydata#3817) Delete associated indexes when deleting coordinate variables. (pydata#3840) ...
…ap-blocks-schema * 'map-blocks-schema' of github.com:dcherian/xarray: Update doc/dask.rst
the function will be first run on mocked-up data, that looks like 'obj' but | ||
has sizes 0, to determine properties of the returned object such as dtype, | ||
variable names, new dimensions and new indexes (if any). | ||
'template' must be provided if the function changes the size of existing dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume attrs are also copied from the template and ignored from the computed chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But I don't know that this is right.
With auto-inferred templates, attrs are set by the user function. I think it would be nice to preserve that behaviour. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this some more... there's no way to update attrs
after computation is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, attrs
needs to be set when the resulting Dataset is created, before we do the the computation.
Looks really nice! I like this design, just a few minor concerns. |
* upstream/master: (39 commits) Pint support for DataArray (pydata#3643) Apply blackdoc to the documentation (pydata#4012) ensure Variable._repr_html_ works (pydata#3973) Fix handling of abbreviated units like msec (pydata#3998) full_like: error on non-scalar fill_value (pydata#3979) Fix some code quality and bug-risk issues (pydata#3999) DOC: add pandas.DataFrame.to_xarray (pydata#3994) Better chunking error messages for zarr backend (pydata#3983) Silence sphinx warnings (pydata#3990) Fix distributed tests on upstream-dev (pydata#3989) Add multi-dimensional extrapolation example and mention different behavior of kwargs in interp (pydata#3956) keep attrs in interpolate_na (pydata#3970) actually use preformatted text in the details summary (pydata#3978) facetgrid: Ensure that colormap params are only determined once. (pydata#3915) RasterioDeprecationWarning (pydata#3964) Empty line missing for DataArray.assign_coords doc (pydata#3963) New coords to existing dim (doc) (pydata#3958) implement a more threadsafe call to colorbar (pydata#3944) Fix wrong order of coordinate converted from pd.series with MultiIndex (pydata#3953) Updated list of core developers (pydata#3943) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite nice.
Question on the name template
. I think in dask.dataframe and dask.array we might call this meta
. Is that keyword already used elsewhere in xarray? template
is also a fine name though.
Thanks for the review @TomAugspurger
I added the BUT it seems to me like there's a better name than |
Makes sense. template seems fine.
…On Thu, Apr 30, 2020 at 3:35 PM Deepak Cherian ***@***.***> wrote:
Thanks for the review @TomAugspurger <https://github.com/TomAugspurger>
Question on the name template. I think in dask.dataframe and dask.array
we might call this meta. Is that keyword already used elsewhere in
xarray? template is also a fine name though.
I added the meta kwarg to apply_ufunc so that users could pass that down
to dask i.e. that meta = dask's meta = np.ndarray or something like that.
So I'd like to avoid reusing meta here where it would exclusively be an
xarray object ≠ dask's meta
BUT it seems to me like there's a better name than template. Any ideas?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3816 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQWE7DGYAOSJWLG5F3RPHOI7ANCNFSM4K7ODDRA>
.
|
I think it's also a good idea to use a different name from dask's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made all requested changes. The only thing remaining is attrs.
EDIT: docs now mention that attrs is copied over from template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
(Though we should probably add that warning about attrs
first)
the function will be first run on mocked-up data, that looks like 'obj' but | ||
has sizes 0, to determine properties of the returned object such as dtype, | ||
variable names, new dimensions and new indexes (if any). | ||
'template' must be provided if the function changes the size of existing dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, attrs
needs to be set when the resulting Dataset is created, before we do the the computation.
I missed this originally @dcherian, but thanks for the great work here. The docs changes are a great help. |
Added docs on attrs issue. This should be good to go. Thanks for the kind words, @bradyrx |
Thanks for the reviews everyone. |
…k-issues * upstream/master: (22 commits) support darkmode (pydata#4036) Use literal syntax instead of function calls to create the data structure (pydata#4038) Add template xarray object kwarg to map_blocks (pydata#3816) Transpose coords by default (pydata#3824) Remove broken test for Panel with to_pandas() (pydata#4028) Allow warning with cartopy in docs plotting build (pydata#4032) Support overriding existing variables in to_zarr() without appending (pydata#4029) chore: Remove unnecessary comprehension (pydata#4026) fix to_netcdf docstring typo (pydata#4021) Pint support for DataArray (pydata#3643) Apply blackdoc to the documentation (pydata#4012) ensure Variable._repr_html_ works (pydata#3973) Fix handling of abbreviated units like msec (pydata#3998) full_like: error on non-scalar fill_value (pydata#3979) Fix some code quality and bug-risk issues (pydata#3999) DOC: add pandas.DataFrame.to_xarray (pydata#3994) Better chunking error messages for zarr backend (pydata#3983) Silence sphinx warnings (pydata#3990) Fix distributed tests on upstream-dev (pydata#3989) Add multi-dimensional extrapolation example and mention different behavior of kwargs in interp (pydata#3956) ...
isort -rc . && black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new APIThis PR adds a
template
kwarg tomap_blocks
so that we can do more complicated things where automated inference of the template fails.template
is expected to be an xarray object that looks like the result of themap_blocks
computation.@jhamman To me, this seems a lot easier than defining a
dict
based schema. With dask variables, the memory cost shouldn't be high. It's easy to use standard xarray operations to make something that looks like the result dataset. Here's a notebook prototyping ato_schema
/from_schema
approach: https://gist.github.com/dcherian/130ba22d0fbadb616837deb914eaa67e#file-map_blocks_for_metsim_test-ipynbTodo:
template.attrs
?