Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert xarray dataset to dask dataframe or delayed objects #1093

Open
atmalagon opened this issue Nov 8, 2016 · 6 comments
Open

Convert xarray dataset to dask dataframe or delayed objects #1093

atmalagon opened this issue Nov 8, 2016 · 6 comments

Comments

@atmalagon
Copy link

It would be great to have a function like dask's to_delayed in order to take xarray datasets and convert them to pandas dataframes chunkily.

http://stackoverflow.com/questions/40475884/how-to-convert-an-xarray-dataset-to-pandas-dataframes-inside-a-dask-dataframe

@shoyer
Copy link
Member

shoyer commented Nov 8, 2016

CC @mrocklin @jcrist

This is a good use case for dask collection duck typing: dask/dask#1068

@shoyer shoyer changed the title convert xarray dataset to pandas dataframe without loading everything into memory Convert xarray dataset to dask dataframe or delayed objects Nov 8, 2016
@jcrist
Copy link

jcrist commented Nov 8, 2016

I'm not sure if I follow how this is a duck typing use case. I'd write this as a method, following your suggestion on SO:

Toward this end, it would be nice if xarray had something like dask.array's to_delayed method for converting a Dataset into an array of delayed datasets, which you could then lazily convert into DataFrame objects and do your computation.

Can you explain why you think this could benefit from collection duck typing?

@shoyer
Copy link
Member

shoyer commented Nov 8, 2016

Can you explain why you think this could benefit from collection duck typing?

Then we could use xarray's normal indexing operations to create a new sub-datasets, wrap them with dask.delayed and start chaining on delayed method calls like to_dataframe. The duck typing is necessary so that dask.delayed knows how to pull the dask graph out from the input Dataset.

@shoyer
Copy link
Member

shoyer commented Nov 8, 2016

The other component that would help for this is some utility function inside xarray to split a Dataset (or DataArray) into sub-datasets for each chunk. Something like:

def split_by_chunks(dataset):
    chunk_slices = {}
    for dim, chunks in dataset.chunks.items():
        slices = []
        start = 0
        for chunk in chunks:
            stop = start + chunk
            slices.append(slice(start, stop))
            start = stop
        chunk_slices[dim] = slices
    for slices in itertools.product(*chunk_slices.values()):
         selection = dict(zip(chunk_slices.keys(), slices))
         yield (selection, dataset[selection])

@dcherian
Copy link
Contributor

dcherian commented Jul 9, 2019

I think this was closed by mistake. Is there a way to split up Dataset chunks into dask delayed objects where each object is a Dataset?

@stale
Copy link

stale bot commented Jun 9, 2021

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants