Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy indexing arrays as a stand-alone package #5081

Open
shoyer opened this issue Mar 27, 2021 · 7 comments
Open

Lazy indexing arrays as a stand-alone package #5081

shoyer opened this issue Mar 27, 2021 · 7 comments

Comments

@shoyer
Copy link
Member

shoyer commented Mar 27, 2021

From @rabernat on Twitter:

"Xarray has some secret private classes for lazily indexing / wrapping arrays that are so useful I think they should be broken out into a standalone package. https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L516"

The idea here is create a first-class "duck array" library for lazy indexing that could replace xarray's internal classes for lazy indexing. This would be in some ways similar to dask.array, but much simpler, because it doesn't have to worry about parallel computing.

Desired features:

A common feature of these operations is they can (and almost always should) be fused with indexing: if N elements are selected via indexing, only O(N) compute and memory is required to produce them, regards of the size of the original arrays as long as the number of applied operations can be treated as a constant. Memory access is significantly slower than compute on modern hardware, so recomputing these operations on the fly is almost always a good idea.

Out of scope: lazy computation when indexing could require access to many more elements to compute the desired value than are returned. For example, mean() probably should not be lazy, because that could involve computation of a very large number of elements that one might want to cache.

This is valuable functionality for Xarray for two reasons:

  1. It allows for "previewing" small bits of data loaded from disk or remote storage, even if that data needs some form of cheap "decoding" from its form on disk.
  2. It allows for xarray to decode data in a lazy fashion that is compatible with full-featured systems for lazy computation (e.g., Dask), without requiring the user to choose dask when reading the data.

Related issues:

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 1, 2023

I'm going to say, the LazilyIndexedArray is pretty cool.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 1, 2023

As a followup question, is the LazilyIndexedArray part of the 'public api'. That is when you do decide to refactor,
https://docs.xarray.dev/en/stable/generated/xarray.core.indexing.LazilyIndexedArray.html

Will you try to warn us users that choose to

from xarray.core.indexing import LazilyIndexedArray

@shoyer
Copy link
Member Author

shoyer commented Feb 2, 2023

Is LazilyIndexedArray really a public API? I don't see it on the API docs page.

Personally I would not want to guarantee external stability/availability for this API in its current state.

@Illviljan
Copy link
Contributor

It is recommended to use it for lazy backends though: https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html#how-to-support-lazy-loading

@TomNicholas
Copy link
Member

FYI there is interesting work going on in dask-land on "dask-expressions" https://github.com/dask-contrib/dask-expr, which is an experiment in doing high-level "query optimization". Computations get represented as a tree of expressions, which will be optimised in various ways before execution.

It doesn't yet support arrays, only dataframes, but a similar effort for arrays would potentially be a generalization of the lazy-array package idea.

cc @phofl who I understand is working on dask expressions for dataframes, with whom I chatted about this briefly at SciPy.

@d-v-b
Copy link

d-v-b commented Dec 15, 2023

we are discussing something related over in zarrland: zarr-developers/zarr-python#1603

@TomNicholas
Copy link
Member

There is another discussion about a lazy arrays package as an implementation of the array API standard in data-apis/array-api#777

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants