-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notion of "distance" or "scale" for indexes and selection #5453
Comments
It appears enough has been done in the explicit indexing refactor that this is now possible! Using xarray 2024.6.0: import xarray as xr
import numpy as np
# some example data
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 100]})
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
print(ds)
# We create a class which is a subclass of xarray.Index
# https://docs.xarray.dev/en/latest/internals/how-to-create-custom-index.html
from xarray.core.indexes import PandasIndex, IndexSelResult, normalize_label, get_indexer_nd
# However, the built-in PandasIndex is _almost_ what we want,
# so we subclass it to steal most of the functionality
class LogIndexer(PandasIndex):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Except for .sel, where we can inject our custom logic!
def sel(self, labels: dict, method=None, tolerance=None) -> IndexSelResult:
# based on the code in https://github.com/pydata/xarray/blob/ce5130f39d780cdce87366ee657665f4a5d3051d/xarray/core/indexes.py#L745
if method == "nearest_log":
assert len(labels) == 1
coord_name, label = next(iter(labels.items()))
label_array = normalize_label(label, dtype=self.coord_dtype)
indexer = get_indexer_nd(np.log(self.index), np.log(label_array), method="nearest", tolerance=tolerance)
return IndexSelResult({self.dim: indexer})
return super().sel(labels, method, tolerance)
# Now we need to make a dataset that uses this custom index
ds_log = ds.drop_indexes('x').set_xindex('x', LogIndexer)
print(ds_log)
# x has values 10 and 100, to make it easy to observe the differences here:
print("normal:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest").x.data)
print("log:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest_log").x.data)
# prints:
# normal: [ 10 10 10 10 10 100]
# log: [ 10 10 10 100 100 100] |
I think we'd happily bundle that in Xarray, seems common enough. cc @benbovy |
To hew closer to my original suggestion, one could also accept a pointwise transform in the constructor, and then E.g., the This is more general (e.g. you could also do something like np.exp), but you lose the ability to use both |
Yes this would be nice to have this functionality built-in Xarray. Another option could be to have some This is even more general and could be useful also for indexes implementing alignment with some tolerance (e.g., #4489), although that might seem a little far-fetched.
It is still possible to duplicate the coordinate so each have an index working on a different scale, e.g., "foo" and "foo_log"). |
I'd be happy to open a PR on either approach if there's a clear favorite Edit: I may be misunderstanding, but it's not clear to me that a meta-index would work here, since the transformation also needs to be applied to the coordinate inside the index, not just the labels coming in to the sel call. |
Yes you are right, I didn't look carefully enough at your example and top comment. |
Given that:
|
It depends on whether we want to have this feature for free also in all indexes built on top of PandasIndex (most index types I've seen so far are either subclassing or encapsulating PandasIndex) vs. we want to keep PandasIndex as simple as possible (i.e., consistent with pandas and no extra option). From a broader perspective it seems that the flexible distance / scale discussed here might be useful for both selection and alignment so maybe it is preferable to address it outside of PandasIndex? Something like a I'm curious what others think, though! |
Is your feature request related to a problem? Please describe.
I've been using xarray with atmospheric data given on pressure levels. This data is best thought of in log(pressure) for computation, but it is stored and displayed as standard numbers. I would love if there was some way to have
data.sel(lev=4, method='nearest')
return the nearest data in log-space without having to have my index stored in the log space.E.g, currently
but in the scientific sense underlying this data, the closer point was actually 'b' at 10, not 'a' at 0.1.
In general, one can imagine situations where the opposite is true (storing data in log-space for numerical accuracy, but wanting a concept of 'nearest' which is the standard linear sense), or a desire for arbitrary scaling.
Describe the solution you'd like
The simplest solution I can imagine is to provide a preprocessor argument to the sel function which operates over numpy values and is used before the call to
get_loc
.e.g.
I believe this can be implemented by wrapping both index and label_value here with a call to the preprocess function (assuming the argument is only desired alongside the 'method' kwarg):
xarray/xarray/core/indexes.py
Lines 224 to 226 in 9daf9b1
Describe alternatives you've considered
I'm not sure how this would relate to the ongoing work on #1603, but one solution is to include a concept of the underlying number line within the index api. The end result is similar to the proposed implementation, but it would be stored with the index rather than passed to the
sel
method each time. This may keep thesel
api simpler if this feature was only available for a special ScaledIndex class or something like that.One version of this could also be used to set reasonable defaults when plotting, e.g. if a coordinate has a log numberline then it could set the x/yscale to 'log' by default when plotting over that coordinate.
The text was updated successfully, but these errors were encountered: