-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multidimensional groupby #818
Conversation
28bc38f
to
807001e
Compare
Yes, this is awesome! I had a vague idea that As for the specialized "grouper", I agree that that makes sense. It's basically an extension of |
# pandas/hashtable.pyx in pandas.hashtable.Float64HashTable.get_labels (pandas/hashtable.c:10302)() | ||
# pandas/hashtable.so in View.MemoryView.memoryview_cwrapper (pandas/hashtable.c:29882)() | ||
# pandas/hashtable.so in View.MemoryView.memoryview.__cinit__ (pandas/hashtable.c:26251)() | ||
# ValueError: buffer source array is read-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes, this is messy. not much else we can do about it, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to move this traceback to a followup issue and just reference it by number in the code. Otherwise it's pretty distracting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting I file a new issue with pandas? The minimal example seems to be
a = np.arange(10)
a.flags.writeable = False
pd.factorize(a)
This will need to unstack to handle .apply. That will be nice for things like normalization. |
Can you clarify what you mean by this? At what point should the unstack happen? With the current code, apply seems to work ok: >>> da.groupby('lon').apply(lambda x : (x**2).sum())
<xarray.DataArray (lon: 3)>
array([0, 5, 9])
Coordinates:
* lon (lon) int64 30 40 50 But perhaps I am missing a certain use case you have in mind? |
I normally used Should this go into a separate PR? |
Let me try to clarify what I mean in item 2:
Say you have the following dataset >>> ds = xr.Dataset(
{'temperature': (['time','nx'], [[1,1,2,2],[2,2,3,3]] ),
'humidity': (['time','nx'], [[1,1,1,1],[1,1,1,1]] )}) Now imagine you want to average humidity in temperature coordinates. (This might sound like a bizarre operation, but it is actually the foundation of a sophisticated sort of thermodynamic analysis.) Currently this works as follows >>> ds = ds.set_coords('temperature')
>>> ds.humidity.groupby('temperature').sum()
<xarray.DataArray 'humidity' (temperature: 3)>
array([2, 4, 2])
Coordinates:
* temperature (temperature) int64 1 2 3 However, this sums over all time. What if you wanted to preserve the time dependence, but replace the ds.humidity.groupby('temperature', group_over='nx').sum() and get back a DataArray with dimensions Maybe this is already possible with a sophisticated use of |
(Oops, pressed the wrong button to close)
Consider |
@rabernat - I don't have much to add right now but I've very excited about this addition. Once you've filled in few more of the features, ping me and I'll give it a full review and will test it out in some applications we have in house. |
@shoyer I'm having a tough time figuring out where to put the unstacking logic...maybe you can give me some advice. My first idea was to add a method to the GroupBy class called If you think that is the right approach, I will forge ahead. But maybe, as the author of both the groupby and stack / unstack logic, you can see an easier way. |
@rabernat That looks like exactly the right place to me. We only use variables for the concatenation in the |
self._unstacked_dims = orig_dims | ||
# we also need to rename the group name to avoid a conflict when | ||
# concatenating | ||
group_name += '_groups' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was something I had to add in order to handle the inevitable namespace conflict that happens when trying to concat the unstacked arrays in apply. They will inevitably have a coordinate with the same name as the groupby dimension (group.name), so I add a suffix to the group name.
I don't think this is necessarily a bad thing, since there is a conceptual difference between the multi-dimensional coordinate (e.g. lon) and the one-dimensional group (e.g. lon_groups).
My new commit supports unstacking in apply with Consider the behavior of the text case: >>> da = xr.DataArray([[0,1],[2,3]],
coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
'lat': (['ny','nx'], [[10,10],[20,20]] ),},
dims=['ny','nx'],
>>> da.groupby('lon').apply(lambda x : x - x.mean(), shortcut=False)
<xarray.DataArray (lon_groups: 3, ny: 2, nx: 2)>
array([[[ 0. , nan],
[ nan, nan]],
[[ nan, -0.5],
[ 0.5, nan]],
[[ nan, nan],
[ nan, 0. ]]])
Coordinates:
* ny (ny) int64 0 1
* nx (nx) int64 0 1
lat (lon_groups, ny, nx) float64 10.0 nan nan nan nan 10.0 20.0 ...
lon (lon_groups, ny, nx) float64 30.0 nan nan nan nan 40.0 40.0 ...
* lon_groups (lon_groups) int64 30 40 50 When unstacking, the indices that are not part of the group get filled with nans. We are not able to put these arrays back together into a single array. Note that if we do not rename the group name here: Then we get an error here:
|
applied = (maybe_wrap_array(arr, func(arr, **kwargs)) for arr in grouped) | ||
applied = (self._maybe_unstack_array( | ||
maybe_wrap_array(arr,func(arr, **kwargs))) | ||
for arr in grouped) | ||
combined = self._concat(applied, shortcut=shortcut) | ||
result = self._maybe_restore_empty_groups(combined) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to unstack result
once, not each array in applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that seems obvious now! (But only in retrospect. ;)
I just tried putting the call to _maybe_unstack_arrays
around combined
.
But the problem is that the multi-index is no longer present in combined
. If I add a call to print (combined.indexes)
, I get
stacked_ny_nx: Index([(0, 0), (0, 1), (1, 0), (1, 1)], dtype='object', name=u'stacked_ny_nx')
i.e. just a regular index. This makes unstacking impossible (or a lot harder). I guess this is related to #769.
Unfortunately I have to copy the group variable because of the pandas / cython bug referenced above.
So kind of stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind...I think I see the way forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue here is actually that Variable.concat
with a multiindex does not preserve the multiindex.
I think that if unstack things properly (only once instead of on each applied example) we should get something like this, alleviating the need for the new group name:
|
5b64e6f
to
b0db8a8
Compare
I think I got it working. |
The travis build failure is a conda problem, not my commit. |
@shoyer regarding the binning, should I modify |
@rabernat I'm not quite sure resample is the right place to put this, given that we aren't resampling on an axis. Just opened a pandas issue to discuss: pandas-dev/pandas#12828 |
I have tried adding a new keyword The way it works is like this: >>> ar = xr.DataArray(np.arange(4), dims='dim_0')
>>> ar
<xarray.DataArray (dim_0: 4)>
array([0, 1, 2, 3])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3
>>> ar.groupby('dim_0', bins=[2,4]).sum()
<xarray.DataArray (dim_0: 2)>
array([1, 5])
Coordinates:
* dim_0 (dim_0) int64 2 4 The only problem is that it seems to overwrite the original dimension of the array! After calling groupby >>> ar
<xarray.DataArray (dim_0: 4)>
array([0, 1, 2, 3])
Coordinates:
* dim_0 (dim_0) int64 2 4 I think that I guess something similar should be possible here... |
@@ -338,6 +356,11 @@ def lookup_order(dimension): | |||
new_order = sorted(stacked.dims, key=lookup_order) | |||
return stacked.transpose(*new_order) | |||
|
|||
def _restore_multiindex(self, combined): | |||
if self._stacked_dim is not None and self._stacked_dim in combined.dims: | |||
combined[self._stacked_dim] = self.group[self._stacked_dim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is where we somehow modify the original object being grouped? maybe better to try using assign
instead of mutation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But to clarify, we are just restoring the dimension to what it was supposed to be before the multiindex index got mangled by _concat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this doesn't seem like the likely source -- I'm just going out on a limb here. Also, we might want to actually fix this in concat eventually instead (that would be the more general solution).
So I tracked down the cause of the original array dimensions being overwritten. It happens within result._coords[concat_dim.name] = as_variable(concat_dim, copy=True) At this point, @shoyer should I just focus on the case where |
Ah, now I see what you were going for. More going on here than I realized. That's a nice plot :) |
Just a little update--I realized that calling apply on multidimensional binned groups fails when the group is not reduced. For example ds.groupby_bins('lat', lat_bins).apply(lambda x: x - x.mean()) raises errors because of conflicting coordinates when trying to concat the results. I only discovered this when making my tutorial notebook. I think I know how to fix it, but I haven't had time yet. So it is moving along... I am excited about this feature and am confident it can make it into the next release. |
@shoyer, @jhamman, could you give me some feedback on one outstanding issue with this PR? I am stuck on a kind of obscure edge case, but I really want to get this finished. Consider the following groupby operation, which creates bins which are finer than the original coordinate. In other words, some bins are empty because there are too many bins. dat = xr.DataArray(np.arange(4))
dim_0_bins = np.arange(0,4.5,0.5)
gb = dat.groupby_bins('dim_0', dim_0_bins)
print(gb.groups) gives
If I try a reducing apply operation, e.g. gb.apply(lambda x: x - x.mean()) I get an error on the concat step
I'm really not sure what the "correct behavior" should even be in this case. It is not even possible to reconstitute the original data array by doing Do you have any thoughts / suggestions? I'm not sure I can solve this issue right now, but I would at least like to have a more useful error message. |
I think I can fix this, by making concatenation work properly on index objects. Stay tuned... |
@shoyer: I'm not sure this is as simple as a technical fix. It is a design question. With regular With
In both cases, it is not obvious to me what should happen when calling |
Empty groups should be straightforward -- we should be able handle them. Indices which don't belong to any group are indeed more problematic. I think we have three options here:
I think my preference would be for option 3, though 1 or 2 could be reasonable work arounds for now (raising |
I think #875 should fix the issue with concatenating index objects. |
Should I try to merge your branch with my branch...or wait for your branch to get merged into master? |
Looks like I still have a bug (failing Travis builds). Let me see if I can On Wed, Jun 8, 2016 at 11:51 AM, Ryan Abernathey [email protected]
|
f8ff81a
to
237fc39
Compare
I just rebased and updated this PR. I have not resolved all of the edge cases, such as what to do about non-reducing groupby_bins operations that don't span the entire coordinate. Unfortunately merging @shoyer's fix from #875 did not resolve this problem, at least not in a way that was obvious to me. My feeling is that this PR in its current form introduces some very useful new features. For my part, I am eager to start using it for actual science projects. Multidimensional grouping is unfamiliar territory. I don't think every potential issue can be resolved by me right now via this PR--I don't have the necessary skills, nor can I anticipate every use case. I think that getting this merged and out in the wild will give us some valuable user feedback which will help figure out where to go next. Plus it would get exposed to developers with the skills to resolve some of the issues. By waiting much longer, we risk it going stale, since lots of other xarray elements are also in flux. Please let me know what you think. |
@@ -343,6 +343,60 @@ def groupby(self, group, squeeze=True): | |||
group = self[group] | |||
return self.groupby_cls(self, group, squeeze=squeeze) | |||
|
|||
def groupby_bins(self, group, bins, right=True, labels=None, precision=3, | |||
include_lowest=False, squeeze=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent should match the opening parentheses
@rabernat I agree. I have a couple of minor style/pep8 issues, and we need an entry for "what's new", but let's merge this. I can then play around a little bit with potential fixes. |
OK, merging..... |
Many datasets have a two dimensional coordinate variable (e.g. longitude) which is different from the logical grid coordinates (e.g. nx, ny). (See #605.) For plotting purposes, this is solved by #608. However, we still might want to split / apply / combine over such coordinates. That has not been possible, because groupby only supports creating groups on one-dimensional arrays.
This PR overcomes that issue by using
stack
to collapse multiple dimensions in the group variable. A minimal example of the new functionality isThis feature could have broad applicability for many realistic datasets (particularly model output on irregular grids): for example, averaging non-rectangular grids zonally (i.e. in latitude), binning in temperature, etc.
If you think this is worth pursuing, I would love some feedback.
The PR is not complete. Some items to address are
grouper
is specified, theGroupBy
object uses all unique values to define the groups. With a high resolution dataset, this could balloon to a huge number of groups. With the latitude example, we would like to be able to specify e.g. 1-degree bins. Usage would beda.groupby('lon', bins=range(-90,90))
.