Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN-sized chunks #2801

Open
crusaderky opened this issue Mar 5, 2019 · 2 comments
Open

NaN-sized chunks #2801

crusaderky opened this issue Mar 5, 2019 · 2 comments
Labels
topic-arrays related to flexible array support

Comments

@crusaderky
Copy link
Contributor

It would be nice to have support for NaN-sized dask chunks, e.g. x[x > 2].
There are two problems:

  1. x[x > 2] silently resolves the dask graph. It definitely shouldn't. There needs to be some discussion on what needs to happen to indices on the NaN-sized dimension; I can think of 3 options:
  • silently drop any index that would become undefined
  • drop any index that would become undefined and issue a warning
  • hard crash if there is any index that would become undefined
  • redesign IndexVariable so that it can contain dask data (probably much more complicated than the 3 above).
    The above design decision is anyway for when there is an index; dims without indices should just work.
  1. This crashes:
>>> xarray.DataArray(a.data[a.data > 2]).compute()

ValueError: replacement data must match the Variable's shape

I didn't investigate but I suspect it should be trivial to fix. I'm not sure why there is a check at all? Any such health check should be in dask only IMHO.

@fujiisoup
Copy link
Member

  1. x[x > 2] silently resolves the dask graph.

Currently, xarray's indexing does not work for dask-indexers.
At least, there is a TODO

def _nonzero(self):
""" Equivalent numpy's nonzero but returns a tuple of Varibles. """
# TODO we should replace dask's native nonzero
# after https://github.com/dask/dask/issues/1076 is implemented.
nonzeros = np.nonzero(self.data)
return tuple(Variable((dim), nz) for nz, dim

but I am not sure if it is the only one thing to prevent this.

  1. This crashes:

xarray.DataArray(a.data[a.data > 2]).compute()

I am not sure whether it is our issue or that in upstream.

a.data[(a < 2).data]

does not clash, indicating dask can not take xr.DataArray as an index (note, a < 2 is an xr.DataArray).

@shoyer
Copy link
Member

shoyer commented Mar 5, 2019

  1. x[x > 2] silently resolves the dask graph. It definitely shouldn't.

Yes, agreed.

There needs to be some discussion on what needs to happen to indices on the NaN-sized dimension

The simple answer is to not allow undefined indices on NaN sized dimensions for now ("hard crash"). I don't think silently dropping indexes is a good idea.

Long term (after #1603 is complete), I can imagine supporting some sort of "lazy index" class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

No branches or pull requests

4 participants