Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor xr.save_mfdataset() to automatically save an xarray object backed by dask arrays to multiple files #4527

Open
andersy005 opened this issue Oct 20, 2020 · 2 comments

Comments

@andersy005
Copy link
Member

Is your feature request related to a problem? Please describe.

Currently, when a user wants to write multiple netCDF files in parallel with xarray and dask, they can take full advantage of xr.save_mfdataset() function. This function in its current state works fine, but the existing API requires that

  • the user generates file paths themselves
  • the user maps each chunk or dataset to a corresponding output file

A few months ago, I wrote a blog post showing how to save an xarray dataset backed by dask into multiple netCDF files, and since then I've been meaning to request a new feature to make this process convenient for users.

Describe the solution you'd like

Would it be useful to actually refactor the existing xr.save_mfdataset() to automatically save an xarray object backed by dask arrays to multiple files without needing to create paths ourselves? Today, this can be achieved via xr.map_blocks. In other words, is it possible to have something analogous to to_zarr(....) but for netCDF:

ds.save_mfdataset(prefix="directory/my-dataset")

# or

xr.save_mfdataset(ds, prefix="directoy/my-dataset")

---->

directory/my-dataset-chunk-1.nc
directory/my-dataset-chunk-2.nc
directory/my-dataset-chunk-3.nc
....
@shoyer
Copy link
Member

shoyer commented Oct 22, 2020

I agree, this does sound useful!

It might make sense to split this into a few pieces of functionality:

  1. A new helper function that splits an xarray object into separate objects for each chunk, including some representation of the "chunk id". Perhaps split_chunks()?
  2. A new higher level function that combines (1) and the existing save_mfdataset to automatically save an xarray object into multiple files. This probably should be a new function rather than using the existing save_mfdataset because the API is different.

@dcherian
Copy link
Contributor

Perhaps split_chunks()?

There was a proposal for .blocks (#3147 (comment)) I agree that 'chunk id' would be useful. Does dask expose that somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants