-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Optional indexes (no more default coordinates given by range(n)) #1017
Conversation
I have no use case for this functionality, so I have no strong opinion about it. Two questions though:
|
Basically, this is about the user experience and first impressions. There are very few cases when somebody would prefer a default index to no index at all, so I see few cases for From the experience of new users, it's really nice to be able to incrementally try out features from a new library. Seeing extra information appear in the data model that they didn't add makes people (rightfully) nervous, because they don't know how it will work yet.
Labeled dimensions without coordinate labels actually get you plenty. You get better broadcasting, aggregation (e.g., But the big advantage is the ability to cleanly mix dimensions with and without indexes on the same objects, which covers several more use cases for labeled arrays. Examples off hand include images (see the example from my first post) and machine learning models (where columns usually have labels corresponding to features but rows often are simply unlabeled examples). |
FWIW, I solved this issue in a slightly different way in my own labelled array project (https://github.com/liam2/larray) (that I still hope to one day merge with xarray -- I will probably need to rewrite my project on top of xarray because the ship as sailed concerning the user-facing API): by default, you get "wildcard" axes, which only have a size and no labels (they do get a range() labels on demand, so you can .sel on that dimension -- to speak in xarray vocabulary). Those wildcard labels are not as picky as normal labels: a wildcard axis compares equal/aligns to other axes as long as it has the same length. In practice, I guess it will be very similar to not having an index at all (and it is probably cleaner this way, but I didn't think of that at the time). All of this to say that yes, this PR is definitely a good idea and would make xarray useful in more situations, as I have hit a lot of cases where real range() labels like you have now made things a lot more painful than necessary. The only advantage I can think of now (except it was easier for me to implement it this way) of having a "wildcard axis" instead of no index/labels at all is that a subset could keep the information about which "tick" it comes from (again without blocking alignment). Not sure it's worth it though (I have actually not implemented it this way yet, but I was contemplating doing so). |
@gdementen Thanks for chiming in! Yes, in practice I think "no index" for xarray will work almost exactly the same as your "wildcard index".
I'm not a fan of this one. It's a perpetual minor annoyance with pandas to subset a meaningless |
@shoyer Honestly, I have not thought that part (keep the tick labels for subsets) through (since I did not encounter an actual use case for that yet), I was more or less dumping my thought process in case you can make something useful out of it :). Nevertheless, those labels would not necessarily be non-sensical. In your image example above, it seems to me that knowing that the region of the image you are using comes (for example) from the center-right of the original image could be useful information. As for conveying that the dimension is special, I use a "*" next to the axis name to convey that it is a wildcard axis. It seems to go well with my current users. |
How does one select / slice a dataarray with no index? Does |
For
|
That sounds like the right way to go. |
I think about some possible use cases where this behavior - if I understand it well - may not be desired. For example, if we want to compute partial derivatives by finite difference, using xarray would not give the expected result (as numpy does): >>> z = np.random.rand(5, 5)
>>> dz_dx_numpy = z[1:, :] - z[:-1, :] # forward difference
>>> dz_dx_numpy
array([[-0.16122906, -0.73841927, 0.11565084, 0.94529388, 0.04245993],
[ 0.21066572, 0.11964058, -0.11203992, -0.52073269, -0.50827324],
[-0.42100012, 0.39873985, 0.07957889, -0.02071456, 0.59944732],
[-0.53819024, -0.29738881, 0.35238292, 0.01903809, 0.15671588]])
>>> da = xr.DataArray(z, dims=('x', 'y'), name='z')
>>> dz_dx_xarray = da[1:, :] - da[:-1, :] # forward difference
>>> dz_dx_xarray
<xarray.DataArray 'z' (x: 3, y: 5)>
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
Coordinates:
* x (x) int64 1 2 3 However, I guess that this specific kind of problem should rather be addressed using the upcoming logic for applying vectorized functions to xarray objects (#964).
That sounds a bit weird to me (I'm not sure to understand, actually). What are the reasons/benefits of returning a |
@benbovy This is actually a good use case for no dimension labels. E.g., from my working branch:
This does depend on the details of how
The (theoretical) benefit would be an easier transition, because previously As I'm working through porting xarray's test suite, I'm realizing that this may not be the best approach. If a user is relying on Either way, the functionality in your |
Oops, I missed |
New design question: What should the new behavior for
|
I have never encountered this case yet but ignoring that dimension seems like a bad idea to me. When I use a.reindex_like(b), I usually mean "make a compatible with b", so I assume the resulting index and shape is the same than (or at least compatible with) b. More importantly, I expect to be able to do binary operations between the re-indexed a and b. Ignoring the index like you propose would break that. Given that, I would go with either an error, or treat the missing index like it was range() in this case, ie fill the extra values with NAN. I think I would have a slight preference for the later in my own code (wildcard axes), but in xarray I am unsure. The error might be more coherent. |
@gdementen thanks for the input. I am inclined to agree. Even within the xarray codebase, we basically use
I think we want the error here, given that this is one of the major motivations for allowing missing coordinate labels (not assuming invalid labels). |
Yes. But, in that case you need a way to do the "fill with NAN" option without having to jump through too many hoops. How would you do that? |
+1 for the align error.
Using |
6a0a1f4
to
fefd741
Compare
This is ready for review. (The failing test is unrelated -- a regression in dask.array.) |
a1a2dbf
to
eb70506
Compare
I've gone through the docs updated every mention I could find of default indexes. So, I think this really is nearly ready to merge now -- review would be highly appreciated. I've added renderings of a few choice sections of the docs to the top post. The last remaining design decision is how to handle the transition. I would really prefer to avoid a deprecation cycle involving issuing warnings in new users' first encounter with xarray. This means that dependent libraries will need to be updated if this causes them to break (which I think is likely). My best idea is to issue a "beta" release, write a mailing list post and give people lots of time to test and update their packages (a few weeks to a month). |
I've gone through it and it works great. 1 e.g. change this
to:
2
I had to change it to:
which is annoyingly ugly. |
I also understand (1), but renaming Another suggestion may be to "highlight" in the top list the dimensions which have an index, using the same symbol than in the coordinate list:
|
@crusaderky thanks for giving this a try! RE: missing dimension in the repr I like the idea of mirroring I'm less sure about adding markers for empty dimensions to coordinates. That makes for a much longer repr for some simple examples, e.g.,
vs
RE: drop I understand the annoyance of What we could do is add a separate |
@shoyer: |
I tried adding
|
|
In python sets have a Overall no strong view; I'm probably a weak vote against adding |
Another option: finally add a boolean |
Fixes GH242 This is useful for getting rid of extraneous scalar variables that arise from indexing, and in particular will resolve an issue for optional indexes: pydata#1017 (comment)
Done -- missing coordinates are marked by Are there any other concerns before I merge this? |
The only concern I have for the repr changes is using the symbol
or use another symbol like
and/or even remove the coordinate name (because there is no coordinate)
|
This seems like the best alternative to me. I don't like omitting the variable name because it seems that it might fall under the previous row, like a level in the MultiIndex repr. |
Done. Any further concerns? I'd really like to merge this and then get the 0.9 release out shortly after. |
Go on :)
…On 15 Dec 2016 02:08, "Stephan Hoyer" ***@***.***> wrote:
or use another symbol like o that means 'no index'?
This seems like the best alternative to me. I don't like omitting the
variable name because it seems that it might fall under the previous row,
like a level in the MultiIndex repr.
Done. Any further concerns? I'd really like to merge this and then get the
0.9 release out shortly after.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1017 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AF7OME8iMlABXbPlTkQVPRS9PG4e4_Sgks5rIKErgaJpZM4KFv_D>
.
|
OK, in it goes! |
* Add drop=True argument to isel, sel and squeeze Fixes GH242 This is useful for getting rid of extraneous scalar variables that arise from indexing, and in particular will resolve an issue for optional indexes: #1017 (comment) * More tests for Dataset.squeeze(drop=True) * Add two more tests, for drop=True without coords
@fmaussion has raised some concerns about the new repr in #1199 |
Fixes #283
Motivation
Currently, when a Dataset or DataArray is created without explicit coordinate labels for a dimension, we insert a coordinate with the values given by range(n).
This is problematic, for two main reasons:
As is, xarray isn't a good fit for users who don't have meaningful coordinate labels over one or more dimensions. So making labels optional would also increase the audience for the project.
Design decisions
In general, I have followed the alignment rules I suggested for pandas in wesm/pandas2#17, but there are still some xarray specific design decisions to resolve:
How to handleDecision: insert dummy dimensions withstack(z=['x', 'y'])
when one or more of the original dimensions do not have labels. If we don't insert dummy indexes for MultiIndex levels, then we can't unstack properly anymore.stack()
as necessary.How to handle missing indexes inDecision: if a dimension does not have an associated coordinate, indexing along that dimension with.sel
, e.g.,array.sel(x=0)
whenx
is not inarray.coords
. In the current version of this PR, this errors, but my current inclination is to pass.sel
indexers directly on to.isel
, without remapping labels. This has the bonus of preserving the current behavior for indexing..sel
or.loc
is positional (like.isel
).Should we create dummy/virtual coordinates likeDecision: yes, this the maximally backwards compatible thing to do.range(n)
on demand when indexing a dimension without labels? e.g.,array.coords['x']
would return a DataArray with valuesrange(n)
(importantly, this would not change the originalarray
).What should the new behavior forDecision: users expectreindex_like
be, if the argument has dimensions of different sizes but no dimension labels? Should we raise an error, or simply ignore these dimensions?reindex_like
to work likealign
. We will raise an error if dimension sizes do not match.What does the transition to this behavior look like? Do we simply add it to v0.9.0 with a big warning about backwards compatibility, or do we need to go through some sort of deprecation cycle with the current behavior?Decision: not doing a deprecation cycle, that would be too cumbersomeExamples of new behavior
New doc sections