Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add template xarray object kwarg to map_blocks #3816

Merged
merged 29 commits into from
May 6, 2020

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Mar 2, 2020

This PR adds a template kwarg to map_blocks so that we can do more complicated things where automated inference of the template fails. template is expected to be an xarray object that looks like the result of the map_blocks computation.

@jhamman To me, this seems a lot easier than defining a dict based schema. With dask variables, the memory cost shouldn't be high. It's easy to use standard xarray operations to make something that looks like the result dataset. Here's a notebook prototyping a to_schema/from_schema approach: https://gist.github.com/dcherian/130ba22d0fbadb616837deb914eaa67e#file-map_blocks_for_metsim_test-ipynb

ds = xr.tutorial.load_dataset("air_temperature").chunk({"lat": 5})
template = ds.isel(lat=slice(1, None, 5))
result = xr.map_blocks(lambda x: x.isel(lat=[1]), ds) # fails
result = xr.map_blocks(lambda x: x.isel(lat=[1]), ds, template=template) # yay
xr.testing.assert_equal(result, template)  # passes

Todo:

  • add tests
  • what if the function returns a variable not in the template?
  • should we do anything with template.attrs?

@dcherian dcherian changed the title [WIP] Provide template xarray object to map_blocks [WIP] Add template xarray object kwarg to map_blocks Mar 3, 2020
@pep8speaks
Copy link

pep8speaks commented Mar 6, 2020

Hello @dcherian! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-05 18:13:17 UTC

@dcherian dcherian requested a review from crusaderky March 9, 2020 09:40
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcherian! This is looking great. I ran this on my test problem and it worked right out of the box. The main TODOs I see are:

  • port to DataArray and Dataset methods
  • add a few more tests (mostly around checking expected outputs with and without template)
  • add an example to the docs
  • get some additional eyes on this (probably @shoyer and/or @crusaderky)

xarray/core/parallel.py Show resolved Hide resolved
xarray/tests/test_dask.py Outdated Show resolved Hide resolved
doc/dask.rst Show resolved Hide resolved
doc/dask.rst Outdated Show resolved Hide resolved
* upstream/master: (54 commits)
  Limit repr of arrays containing long strings (pydata#3900)
  expose a few zarr backend functions as semi-public api (pydata#3897)
  Use drawstyle instead of linestyle in plot.step. (pydata#3274)
  Implementation of polyfit and polyval (pydata#3733)
  misplaced quote in whatsnew (pydata#3889)
  Rename ordered_dict_intersection -> compat_dict_intersection (pydata#3887)
  Control attrs of result in `merge()`, `concat()`, `combine_by_coords()` and `combine_nested()` (pydata#3877)
  xfail test_uamiv_format_write (pydata#3885)
  Use `fixes` in PR template (pydata#3886)
  Tweaks to "how_to_release" (pydata#3882)
  whatsnew section for 0.16.0
  Release v0.15.1
  whatsnew for 0.15.1 (pydata#3879)
  update panel documentation (pydata#3880)
  reword the whats-new entry for unit support (pydata#3878)
  Raise error when assigning to IndexVariable.values & IndexVariable.data (pydata#3862)
  Re-enable tests xfailed in pydata#3808 and fix new CFTimeIndex failures due to upstream changes (pydata#3874)
  add spacing in the versions section of the issue report (pydata#3876)
  map_blocks: allow user function to add new unindexed dimension. (pydata#3817)
  Delete associated indexes when deleting coordinate variables. (pydata#3840)
  ...
…ap-blocks-schema

* 'map-blocks-schema' of github.com:dcherian/xarray:
  Update doc/dask.rst
doc/dask.rst Outdated Show resolved Hide resolved
xarray/core/parallel.py Show resolved Hide resolved
xarray/tests/test_dask.py Show resolved Hide resolved
xarray/core/parallel.py Show resolved Hide resolved
the function will be first run on mocked-up data, that looks like 'obj' but
has sizes 0, to determine properties of the returned object such as dtype,
variable names, new dimensions and new indexes (if any).
'template' must be provided if the function changes the size of existing dimensions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume attrs are also copied from the template and ignored from the computed chunks?

Copy link
Contributor Author

@dcherian dcherian May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I don't know that this is right.

With auto-inferred templates, attrs are set by the user function. I think it would be nice to preserve that behaviour. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more... there's no way to update attrs after computation is there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, attrs needs to be set when the resulting Dataset is created, before we do the the computation.

@shoyer
Copy link
Member

shoyer commented Apr 30, 2020

Looks really nice! I like this design, just a few minor concerns.

* upstream/master: (39 commits)
  Pint support for DataArray (pydata#3643)
  Apply blackdoc to the documentation (pydata#4012)
  ensure Variable._repr_html_ works (pydata#3973)
  Fix handling of abbreviated units like msec (pydata#3998)
  full_like: error on non-scalar fill_value (pydata#3979)
  Fix some code quality and bug-risk issues (pydata#3999)
  DOC: add pandas.DataFrame.to_xarray (pydata#3994)
  Better chunking error messages for zarr backend (pydata#3983)
  Silence sphinx warnings (pydata#3990)
  Fix distributed tests on upstream-dev (pydata#3989)
  Add multi-dimensional extrapolation example and mention different behavior of kwargs in interp (pydata#3956)
  keep attrs in interpolate_na (pydata#3970)
  actually use preformatted text in the details summary (pydata#3978)
  facetgrid: Ensure that colormap params are only determined once. (pydata#3915)
  RasterioDeprecationWarning (pydata#3964)
  Empty line missing for DataArray.assign_coords doc (pydata#3963)
  New coords to existing dim (doc) (pydata#3958)
  implement a more threadsafe call to colorbar (pydata#3944)
  Fix wrong order of coordinate converted from pd.series with MultiIndex (pydata#3953)
  Updated list of core developers (pydata#3943)
  ...
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite nice.

Question on the name template. I think in dask.dataframe and dask.array we might call this meta. Is that keyword already used elsewhere in xarray? template is also a fine name though.

doc/dask.rst Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor Author

Thanks for the review @TomAugspurger

Question on the name template. I think in dask.dataframe and dask.array we might call this meta. Is that keyword already used elsewhere in xarray? template is also a fine name though.

I added the meta kwarg to apply_ufunc so that users could pass that down to dask i.e. that meta = dask's meta = np.ndarray or something like that. So I'd like to avoid reusing meta here where it would exclusively be an xarray object ≠ dask's meta

BUT it seems to me like there's a better name than template. Any ideas?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 30, 2020 via email

@shoyer
Copy link
Member

shoyer commented Apr 30, 2020

I think it's also a good idea to use a different name from dask's meta because template is used differently. It's an object with the full shape of the desired output, not size 0 dimensions.

Copy link
Contributor Author

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made all requested changes. The only thing remaining is attrs.

EDIT: docs now mention that attrs is copied over from template.

xarray/core/parallel.py Show resolved Hide resolved
xarray/core/parallel.py Show resolved Hide resolved
xarray/core/parallel.py Show resolved Hide resolved
xarray/tests/test_dask.py Show resolved Hide resolved
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

(Though we should probably add that warning about attrs first)

the function will be first run on mocked-up data, that looks like 'obj' but
has sizes 0, to determine properties of the returned object such as dtype,
variable names, new dimensions and new indexes (if any).
'template' must be provided if the function changes the size of existing dimensions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, attrs needs to be set when the resulting Dataset is created, before we do the the computation.

@bradyrx
Copy link
Contributor

bradyrx commented May 5, 2020

I missed this originally @dcherian, but thanks for the great work here. The docs changes are a great help.

@dcherian dcherian mentioned this pull request May 5, 2020
23 tasks
@dcherian
Copy link
Contributor Author

dcherian commented May 5, 2020

Added docs on attrs issue. This should be good to go.

Thanks for the kind words, @bradyrx

@dcherian
Copy link
Contributor Author

dcherian commented May 6, 2020

Thanks for the reviews everyone.

@dcherian dcherian merged commit 0b6e22f into pydata:master May 6, 2020
@dcherian dcherian deleted the map-blocks-schema branch May 6, 2020 16:41
dcherian added a commit to kmuehlbauer/xarray that referenced this pull request May 9, 2020
…k-issues

* upstream/master: (22 commits)
  support darkmode (pydata#4036)
  Use literal syntax instead of function calls to create the data structure (pydata#4038)
  Add template xarray object kwarg to map_blocks (pydata#3816)
  Transpose coords by default (pydata#3824)
  Remove broken test for Panel with to_pandas() (pydata#4028)
  Allow warning with cartopy in docs plotting build (pydata#4032)
  Support overriding existing variables in to_zarr() without appending (pydata#4029)
  chore: Remove unnecessary comprehension (pydata#4026)
  fix to_netcdf docstring typo (pydata#4021)
  Pint support for DataArray (pydata#3643)
  Apply blackdoc to the documentation (pydata#4012)
  ensure Variable._repr_html_ works (pydata#3973)
  Fix handling of abbreviated units like msec (pydata#3998)
  full_like: error on non-scalar fill_value (pydata#3979)
  Fix some code quality and bug-risk issues (pydata#3999)
  DOC: add pandas.DataFrame.to_xarray (pydata#3994)
  Better chunking error messages for zarr backend (pydata#3983)
  Silence sphinx warnings (pydata#3990)
  Fix distributed tests on upstream-dev (pydata#3989)
  Add multi-dimensional extrapolation example and mention different behavior of kwargs in interp (pydata#3956)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

map_blocks output inference problems
7 participants