Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional groupby #818

Merged
merged 10 commits into from
Jul 8, 2016
Merged

Multidimensional groupby #818

merged 10 commits into from
Jul 8, 2016

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Apr 6, 2016

Many datasets have a two dimensional coordinate variable (e.g. longitude) which is different from the logical grid coordinates (e.g. nx, ny). (See #605.) For plotting purposes, this is solved by #608. However, we still might want to split / apply / combine over such coordinates. That has not been possible, because groupby only supports creating groups on one-dimensional arrays.

This PR overcomes that issue by using stack to collapse multiple dimensions in the group variable. A minimal example of the new functionality is

>>> da = xr.DataArray([[0,1],[2,3]], 
                coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
                        'lat': (['ny','nx'], [[10,10],[20,20]] )},
                dims=['ny','nx'])
>>> da.groupby('lon').sum()
<xarray.DataArray (lon: 3)>
array([0, 3, 3])
Coordinates:
  * lon      (lon) int64 30 40 50

This feature could have broad applicability for many realistic datasets (particularly model output on irregular grids): for example, averaging non-rectangular grids zonally (i.e. in latitude), binning in temperature, etc.

If you think this is worth pursuing, I would love some feedback.

The PR is not complete. Some items to address are

  • Create a specialized grouper to allow coarser bins. By default, if no grouper is specified, the GroupBy object uses all unique values to define the groups. With a high resolution dataset, this could balloon to a huge number of groups. With the latitude example, we would like to be able to specify e.g. 1-degree bins. Usage would be da.groupby('lon', bins=range(-90,90)).
  • Allow specification of which dims to stack. For example, stack in space but keep time dimension intact. (Currently it just stacks all the dimensions of the group variable.)
  • A nice example for the docs.

@shoyer
Copy link
Member

shoyer commented Apr 6, 2016

Yes, this is awesome! I had a vague idea that stack could make something like this possible but hadn't really thought it through.

As for the specialized "grouper", I agree that that makes sense. It's basically an extension of resample from dates to floating point -- noting that pandas recently changed the resample API so it works a little more like groupby. pandas.cut could probably handle most of the logic here.

# pandas/hashtable.pyx in pandas.hashtable.Float64HashTable.get_labels (pandas/hashtable.c:10302)()
# pandas/hashtable.so in View.MemoryView.memoryview_cwrapper (pandas/hashtable.c:29882)()
# pandas/hashtable.so in View.MemoryView.memoryview.__cinit__ (pandas/hashtable.c:26251)()
# ValueError: buffer source array is read-only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes, this is messy. not much else we can do about it, though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to move this traceback to a followup issue and just reference it by number in the code. Otherwise it's pretty distracting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting I file a new issue with pandas? The minimal example seems to be

a = np.arange(10)
a.flags.writeable = False
pd.factorize(a)

@shoyer
Copy link
Member

shoyer commented Apr 6, 2016

This will need to unstack to handle .apply. That will be nice for things like normalization.

@rabernat
Copy link
Contributor Author

rabernat commented Apr 6, 2016

This will need to unstack to handle .apply. That will be nice for things like normalization.

Can you clarify what you mean by this? At what point should the unstack happen?

With the current code, apply seems to work ok:

>>> da.groupby('lon').apply(lambda x : (x**2).sum())
<xarray.DataArray (lon: 3)>
array([0, 5, 9])
Coordinates:
  * lon      (lon) int64 30 40 50

But perhaps I am missing a certain use case you have in mind?

@rabernat
Copy link
Contributor Author

rabernat commented Apr 6, 2016

As for the specialized "grouper", I agree that that makes sense. It's basically an extension of resample from dates to floating point -- noting that pandas recently changed the resample API so it works a little more like groupby. pandas.cut could probably handle most of the logic here.

I normally used numpy.digitize for this type of thing, but pandas.cut indeed seems like the obvious choice.

Should this go into a separate PR?

@rabernat
Copy link
Contributor Author

rabernat commented Apr 6, 2016

Let me try to clarify what I mean in item 2:

Allow specification of which dims to stack.

Say you have the following dataset

>>> ds =  xr.Dataset(
        {'temperature': (['time','nx'], [[1,1,2,2],[2,2,3,3]] ),
         'humidity': (['time','nx'], [[1,1,1,1],[1,1,1,1]] )})

Now imagine you want to average humidity in temperature coordinates. (This might sound like a bizarre operation, but it is actually the foundation of a sophisticated sort of thermodynamic analysis.)

Currently this works as follows

>>> ds = ds.set_coords('temperature')
>>> ds.humidity.groupby('temperature').sum()
<xarray.DataArray 'humidity' (temperature: 3)>
array([2, 4, 2])
Coordinates:
  * temperature  (temperature) int64 1 2 3

However, this sums over all time. What if you wanted to preserve the time dependence, but replace the nx coordinate with temperature. I would like to be able to say

ds.humidity.groupby('temperature', group_over='nx').sum()

and get back a DataArray with dimensions ('time', 'temperature').

Maybe this is already possible with a sophisticated use of apply. But I don't see how to do it.

@shoyer
Copy link
Member

shoyer commented Apr 6, 2016

(Oops, pressed the wrong button to close)

Can you clarify what you mean by this? At what point should the unstack happen?

Consider ds.groupby('latitude').apply(lambda x: x - x.mean()) or ds.groupby('latitude') - ds.groupby('latitude').mean() (these are two ways of writing the same thing). In each of these cases, the result of a groupby has the same dimensions as the original instead of replacing one or more of the original dimensions.

@jhamman
Copy link
Member

jhamman commented Apr 6, 2016

@rabernat - I don't have much to add right now but I've very excited about this addition. Once you've filled in few more of the features, ping me and I'll give it a full review and will test it out in some applications we have in house.

@rabernat
Copy link
Contributor Author

rabernat commented Apr 7, 2016

@shoyer I'm having a tough time figuring out where to put the unstacking logic...maybe you can give me some advice.

My first idea was to add a method to the GroupBy class called _maybe_unstack_array and make a call to it here. The problem with that approach is that the group iteration happens over Variables, not full DataArrays, which means that unstacking is harder to do. Would need to store lots of metadata about the stacked / unstacked dimension names, sizes, etc.

If you think that is the right approach, I will forge ahead. But maybe, as the author of both the groupby and stack / unstack logic, you can see an easier way.

@shoyer
Copy link
Member

shoyer commented Apr 7, 2016

@rabernat That looks like exactly the right place to me.

We only use variables for the concatenation in the shortcut=True path. With shortcut=False, we use DataArray/Dataset objects. For now, get it working with shortcut=False (hard code it if necessary) and I can help figure out how to extend it to shortcut=True.

self._unstacked_dims = orig_dims
# we also need to rename the group name to avoid a conflict when
# concatenating
group_name += '_groups'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something I had to add in order to handle the inevitable namespace conflict that happens when trying to concat the unstacked arrays in apply. They will inevitably have a coordinate with the same name as the groupby dimension (group.name), so I add a suffix to the group name.

I don't think this is necessarily a bad thing, since there is a conceptual difference between the multi-dimensional coordinate (e.g. lon) and the one-dimensional group (e.g. lon_groups).

@rabernat
Copy link
Contributor Author

rabernat commented Apr 7, 2016

My new commit supports unstacking in apply with shortcut=True. However, the behavior is kind of weird, in a way that is unique to the multidimensional case.

Consider the behavior of the text case:

>>> da = xr.DataArray([[0,1],[2,3]],
        coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
                'lat': (['ny','nx'], [[10,10],[20,20]] ),},
        dims=['ny','nx'],
>>> da.groupby('lon').apply(lambda x : x - x.mean(), shortcut=False)
<xarray.DataArray (lon_groups: 3, ny: 2, nx: 2)>
array([[[ 0. ,  nan],
        [ nan,  nan]],

       [[ nan, -0.5],
        [ 0.5,  nan]],

       [[ nan,  nan],
        [ nan,  0. ]]])
Coordinates:
  * ny          (ny) int64 0 1
  * nx          (nx) int64 0 1
    lat         (lon_groups, ny, nx) float64 10.0 nan nan nan nan 10.0 20.0 ...
    lon         (lon_groups, ny, nx) float64 30.0 nan nan nan nan 40.0 40.0 ...
  * lon_groups  (lon_groups) int64 30 40 50

When unstacking, the indices that are not part of the group get filled with nans. We are not able to put these arrays back together into a single array.

Note that if we do not rename the group name here:
https://github.com/pydata/xarray/pull/818/files#diff-96b65e0bfec9fd2b9d562483f53661f5R121

Then we get an error here:
https://github.com/pydata/xarray/pull/818/files#diff-96b65e0bfec9fd2b9d562483f53661f5R407

ValueError: the variable 'lon' has the same name as one of its dimensions ('lon', 'ny', 'nx'), but it is not 1-dimensional and thus it is not a valid index

applied = (maybe_wrap_array(arr, func(arr, **kwargs)) for arr in grouped)
applied = (self._maybe_unstack_array(
maybe_wrap_array(arr,func(arr, **kwargs)))
for arr in grouped)
combined = self._concat(applied, shortcut=shortcut)
result = self._maybe_restore_empty_groups(combined)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to unstack result once, not each array in applied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems obvious now! (But only in retrospect. ;)

I just tried putting the call to _maybe_unstack_arrays around combined.

But the problem is that the multi-index is no longer present in combined. If I add a call to print (combined.indexes), I get

stacked_ny_nx: Index([(0, 0), (0, 1), (1, 0), (1, 1)], dtype='object', name=u'stacked_ny_nx')

i.e. just a regular index. This makes unstacking impossible (or a lot harder). I guess this is related to #769.

Unfortunately I have to copy the group variable because of the pandas / cython bug referenced above.

So kind of stuck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind...I think I see the way forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is actually that Variable.concat with a multiindex does not preserve the multiindex.

@shoyer
Copy link
Member

shoyer commented Apr 7, 2016

I think that if unstack things properly (only once instead of on each applied example) we should get something like this, alleviating the need for the new group name:

<xarray.DataArray (ny: 2, nx: 2)>
array([[ 0. ,  -0.5],
        [ 0.5,  0]])
Coordinates:
  * ny          (ny) int64 0 1
  * nx          (nx) int64 0 1

@rabernat
Copy link
Contributor Author

rabernat commented Apr 7, 2016

I think I got it working.

@rabernat
Copy link
Contributor Author

rabernat commented Apr 7, 2016

The travis build failure is a conda problem, not my commit.

@rabernat
Copy link
Contributor Author

rabernat commented Apr 8, 2016

@shoyer regarding the binning, should I modify resample to allow for non-time dimensions? Or a new function?

@shoyer
Copy link
Member

shoyer commented Apr 8, 2016

@rabernat I'm not quite sure resample is the right place to put this, given that we aren't resampling on an axis. Just opened a pandas issue to discuss: pandas-dev/pandas#12828

@rabernat
Copy link
Contributor Author

rabernat commented Apr 8, 2016

I have tried adding a new keyword bins arg to groupby, which should accomplish what I want and more. (It will also work on regular one-dimensional groupby operations.)

The way it works is like this:

>>> ar = xr.DataArray(np.arange(4), dims='dim_0')
>>> ar
<xarray.DataArray (dim_0: 4)>
array([0, 1, 2, 3])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
>>> ar.groupby('dim_0', bins=[2,4]).sum()
<xarray.DataArray (dim_0: 2)>
array([1, 5])
Coordinates:
  * dim_0    (dim_0) int64 2 4

The only problem is that it seems to overwrite the original dimension of the array! After calling groupby

>>> ar
<xarray.DataArray (dim_0: 4)>
array([0, 1, 2, 3])
Coordinates:
  * dim_0    (dim_0) int64 2 4

I think that resample overcomes this issue by renaming the dimension:
https://github.com/pydata/xarray/blob/master/xarray/core/common.py#L437

I guess something similar should be possible here...

@@ -338,6 +356,11 @@ def lookup_order(dimension):
new_order = sorted(stacked.dims, key=lookup_order)
return stacked.transpose(*new_order)

def _restore_multiindex(self, combined):
if self._stacked_dim is not None and self._stacked_dim in combined.dims:
combined[self._stacked_dim] = self.group[self._stacked_dim]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is where we somehow modify the original object being grouped? maybe better to try using assign instead of mutation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But to clarify, we are just restoring the dimension to what it was supposed to be before the multiindex index got mangled by _concat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this doesn't seem like the likely source -- I'm just going out on a limb here. Also, we might want to actually fix this in concat eventually instead (that would be the more general solution).

@rabernat
Copy link
Contributor Author

So I tracked down the cause of the original array dimensions being overwritten. It happens within _concat_shortcut here:
https://github.com/pydata/xarray/blob/master/xarray/core/groupby.py#L325

result._coords[concat_dim.name] = as_variable(concat_dim, copy=True)

At this point, self.obj gets modified directly.

@shoyer should I just focus on the case where shortcut==False? Or should I try to debug the _concat_shortcut method? Your inline comments ("don't worry too much about maintaining this method") suggest that it is not going to be around forever.

@clarkfitzg
Copy link
Member

Ah, now I see what you were going for. More going on here than I realized. That's a nice plot :)

@rabernat
Copy link
Contributor Author

Just a little update--I realized that calling apply on multidimensional binned groups fails when the group is not reduced. For example

ds.groupby_bins('lat', lat_bins).apply(lambda x: x - x.mean())

raises errors because of conflicting coordinates when trying to concat the results. I only discovered this when making my tutorial notebook. I think I know how to fix it, but I haven't had time yet.

So it is moving along... I am excited about this feature and am confident it can make it into the next release.

@rabernat
Copy link
Contributor Author

rabernat commented Jun 5, 2016

@shoyer, @jhamman, could you give me some feedback on one outstanding issue with this PR? I am stuck on a kind of obscure edge case, but I really want to get this finished.

Consider the following groupby operation, which creates bins which are finer than the original coordinate. In other words, some bins are empty because there are too many bins.

dat = xr.DataArray(np.arange(4))
dim_0_bins = np.arange(0,4.5,0.5)
gb = dat.groupby_bins('dim_0', dim_0_bins)
print(gb.groups)

gives

{'(0.5, 1]': [1], '(2.5, 3]': [3], '(1.5, 2]': [2]}

If I try a reducing apply operation, e.g. gb.mean(), it works fine. However, if I do

gb.apply(lambda x: x - x.mean())

I get an error on the concat step

--> 433         combined = self._concat(applied, shortcut=shortcut)
... [long stack trace]
IndexError: index 3 is out of bounds for axis 1 with size 3

I'm really not sure what the "correct behavior" should even be in this case. It is not even possible to reconstitute the original data array by doing gb.apply(lambda x: x). The same problem arises when the groups do not span the entire coordinate (e.g. dim_0_bins = [1,2,3]).

Do you have any thoughts / suggestions? I'm not sure I can solve this issue right now, but I would at least like to have a more useful error message.

@shoyer
Copy link
Member

shoyer commented Jun 6, 2016

I think I can fix this, by making concatenation work properly on index objects. Stay tuned...

@rabernat
Copy link
Contributor Author

rabernat commented Jun 6, 2016

@shoyer: I'm not sure this is as simple as a technical fix. It is a design question.

With regular groupby, the groups are guaranteed so span the original coordinates exactly, so you can always put the original dataarrays back together from the groupby object, i.e. ds.groupby('dim_0').apply(lambda x: x).

With groupby_bins, the user specifies the bins and might do so in such a way that

  • there are empty groups
  • there are indices which don't belong to to any group

In both cases, it is not obvious to me what should happen when calling .apply(lambda x: x). Especially for the latter, I would probably want to raise an error informing the user that their bins are not sufficient to reconstitute the full index.

@shoyer
Copy link
Member

shoyer commented Jun 6, 2016

Empty groups should be straightforward -- we should be able handle them.

Indices which don't belong to any group are indeed more problematic. I think we have three options here:

  1. Raise an error when calling .groupby_bins(...)
  2. Raise an error when calling .groupby_bins(...).apply(...)
  3. Simply concatenate back together whatever items were grouped, and give up on the guarantee that the applying the identity function restores the original item.

I think my preference would be for option 3, though 1 or 2 could be reasonable work arounds for now (raising NotImplementedError), because 3 is likely to be a little tricky to implement.

@shoyer
Copy link
Member

shoyer commented Jun 8, 2016

I think #875 should fix the issue with concatenating index objects.

@rabernat
Copy link
Contributor Author

rabernat commented Jun 8, 2016

I think #875 should fix the issue with concatenating index objects.

Should I try to merge your branch with my branch...or wait for your branch to get merged into master?

@shoyer
Copy link
Member

shoyer commented Jun 8, 2016

Looks like I still have a bug (failing Travis builds). Let me see if I can
get that sorted out first.

On Wed, Jun 8, 2016 at 11:51 AM, Ryan Abernathey [email protected]
wrote:

I think #875 #875 should fix the
issue with concatenating index objects.

Should I try to merge your branch with my branch...or wait for your branch
to get merged into master?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#818 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABKS1oaZfZ0P384eSGKIQ8-0fbyH8KDWks5qJw86gaJpZM4IAuQH
.

@rabernat rabernat force-pushed the groupby_multidim branch from f8ff81a to 237fc39 Compare July 6, 2016 14:31
@rabernat
Copy link
Contributor Author

rabernat commented Jul 6, 2016

I just rebased and updated this PR. I have not resolved all of the edge cases, such as what to do about non-reducing groupby_bins operations that don't span the entire coordinate. Unfortunately merging @shoyer's fix from #875 did not resolve this problem, at least not in a way that was obvious to me.

My feeling is that this PR in its current form introduces some very useful new features. For my part, I am eager to start using it for actual science projects. Multidimensional grouping is unfamiliar territory. I don't think every potential issue can be resolved by me right now via this PR--I don't have the necessary skills, nor can I anticipate every use case. I think that getting this merged and out in the wild will give us some valuable user feedback which will help figure out where to go next. Plus it would get exposed to developers with the skills to resolve some of the issues. By waiting much longer, we risk it going stale, since lots of other xarray elements are also in flux.

Please let me know what you think.

@@ -343,6 +343,60 @@ def groupby(self, group, squeeze=True):
group = self[group]
return self.groupby_cls(self, group, squeeze=squeeze)

def groupby_bins(self, group, bins, right=True, labels=None, precision=3,
include_lowest=False, squeeze=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent should match the opening parentheses

@shoyer
Copy link
Member

shoyer commented Jul 6, 2016

@rabernat I agree. I have a couple of minor style/pep8 issues, and we need an entry for "what's new", but let's merge this. I can then play around a little bit with potential fixes.

@shoyer
Copy link
Member

shoyer commented Jul 8, 2016

OK, merging.....

@shoyer shoyer merged commit a0a3860 into pydata:master Jul 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants