Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional groupby #818

Merged
merged 10 commits into from
Jul 8, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ Computation
Dataset.apply
Dataset.reduce
Dataset.groupby
Dataset.groupby_bins
Dataset.resample
Dataset.diff

Expand Down Expand Up @@ -245,6 +246,7 @@ Computation

DataArray.reduce
DataArray.groupby
DataArray.groupby_bins
DataArray.rolling
DataArray.resample
DataArray.get_axis_num
Expand Down
1 change: 1 addition & 0 deletions doc/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Examples
examples/quick-overview
examples/weather-data
examples/monthly-means
examples/multidimensional-coords
201 changes: 201 additions & 0 deletions doc/examples/multidimensional-coords.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
.. _examples.multidim:

Working with Multidimensional Coordinates
=========================================

Author: `Ryan Abernathey <http://github.org/rabernat>`__

Many datasets have *physical coordinates* which differ from their
*logical coordinates*. Xarray provides several ways to plot and analyze
such datasets.

.. code:: python

%matplotlib inline
import numpy as np
import pandas as pd
import xarray as xr
import cartopy.crs as ccrs
from matplotlib import pyplot as plt

print("numpy version : ", np.__version__)
print("pandas version : ", pd.__version__)
print("xarray version : ", xr.version.version)


.. parsed-literal::

('numpy version : ', '1.11.0')
('pandas version : ', u'0.18.0')
('xarray version : ', '0.7.2-32-gf957eb8')


As an example, consider this dataset from the
`xarray-data <https://github.com/pydata/xarray-data>`__ repository.

.. code:: python

! curl -L -O https://github.com/pydata/xarray-data/raw/master/RASM_example_data.nc

.. code:: python

ds = xr.open_dataset('RASM_example_data.nc')
ds




.. parsed-literal::

<xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
comment: Output from the Variable Infiltration Capacity (VIC) model.
nco_openmp_thread_number: 1
NCO: 4.3.7
history: history deleted for brevity



In this example, the *logical coordinates* are ``x`` and ``y``, while
the *physical coordinates* are ``xc`` and ``yc``, which represent the
latitudes and longitude of the data.

.. code:: python

print(ds.xc.attrs)
print(ds.yc.attrs)


.. parsed-literal::

OrderedDict([(u'long_name', u'longitude of grid cell center'), (u'units', u'degrees_east'), (u'bounds', u'xv')])
OrderedDict([(u'long_name', u'latitude of grid cell center'), (u'units', u'degrees_north'), (u'bounds', u'yv')])


Plotting
--------

Let's examine these coordinate variables by plotting them.

.. code:: python

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14,4))
ds.xc.plot(ax=ax1)
ds.yc.plot(ax=ax2)




.. parsed-literal::

<matplotlib.collections.QuadMesh at 0x118688fd0>



.. parsed-literal::

/Users/rpa/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):



.. image:: multidimensional_coords_files/xarray_multidimensional_coords_8_2.png


Note that the variables ``xc`` (longitude) and ``yc`` (latitude) are
two-dimensional scalar fields.

If we try to plot the data variable ``Tair``, by default we get the
logical coordinates.

.. code:: python

ds.Tair[0].plot()




.. parsed-literal::

<matplotlib.collections.QuadMesh at 0x11b6da890>




.. image:: multidimensional_coords_files/xarray_multidimensional_coords_10_1.png


In order to visualize the data on a conventional latitude-longitude
grid, we can take advantage of xarray's ability to apply
`cartopy <http://scitools.org.uk/cartopy/index.html>`__ map projections.

.. code:: python

plt.figure(figsize=(14,6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ds.Tair[0].plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x='xc', y='yc', add_colorbar=False)
ax.coastlines()
ax.set_ylim([0,90]);



.. image:: multidimensional_coords_files/xarray_multidimensional_coords_12_0.png


Multidimensional Groupby
------------------------

The above example allowed us to visualize the data on a regular
latitude-longitude grid. But what if we want to do a calculation that
involves grouping over one of these physical coordinates (rather than
the logical coordinates), for example, calculating the mean temperature
at each latitude. This can be achieved using xarray's ``groupby``
function, which accepts multidimensional variables. By default,
``groupby`` will use every unique value in the variable, which is
probably not what we want. Instead, we can use the ``groupby_bins``
function to specify the output coordinates of the group.

.. code:: python

# define two-degree wide latitude bins
lat_bins = np.arange(0,91,2)
# define a label for each bin corresponding to the central latitude
lat_center = np.arange(1,90,2)
# group according to those bins and take the mean
Tair_lat_mean = ds.Tair.groupby_bins('xc', lat_bins, labels=lat_center).mean()
# plot the result
Tair_lat_mean.plot()




.. parsed-literal::

[<matplotlib.lines.Line2D at 0x11cb92e90>]




.. image:: multidimensional_coords_files/xarray_multidimensional_coords_14_1.png


Note that the resulting coordinate for the ``groupby_bins`` operation
got the ``_bins`` suffix appended: ``xc_bins``. This help us distinguish
it from the original multidimensional variable ``xc``.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 62 additions & 4 deletions doc/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@ __ http://www.jstatsoft.org/v40/i01/paper
- Combine your groups back into a single data object.

Group by operations work on both :py:class:`~xarray.Dataset` and
:py:class:`~xarray.DataArray` objects. Currently, you can only group by a single
one-dimensional variable (eventually, we hope to remove this limitation). Also,
note that for one-dimensional data, it is usually faster to rely on pandas'
implementation of the same pipeline.
:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
a single one-dimensional variable, although support for grouping
over a multi-dimensional variable has recently been implemented. Note that for
one-dimensional data, it is usually faster to rely on pandas' implementation of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that...

the same pipeline.

Split
~~~~~
Expand Down Expand Up @@ -63,6 +64,33 @@ You can also iterate over over groups in ``(label, group)`` pairs:
Just like in pandas, creating a GroupBy object is cheap: it does not actually
split the data until you access particular values.

Binning
~~~~~~~

Sometimes you don't want to use all the unique values to determine the groups
but instead want to "bin" the data into coarser groups. You could always create
a customized coordinate, but xarray facilitates this via the
:py:meth:`~xarray.Dataset.groupby_bins` method.

.. ipython:: python

x_bins = [0,25,50]
ds.groupby_bins('x', x_bins).groups

The binning is implemented via `pandas.cut`__, whose documentation details how
the bins are assigned. As seen in the example above, by default, the bins are
labeled with strings using set notation to precisely identify the bin limits. To
override this behavior, you can specify the bin labels explicitly. Here we
choose `float` labels which identify the bin centers:

.. ipython:: python

x_bin_labels = [12.5,37.5]
ds.groupby_bins('x', x_bins, labels=x_bin_labels).groups

__ http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.cut.html


Apply
~~~~~

Expand Down Expand Up @@ -149,3 +177,33 @@ guarantee that all original dimensions remain unchanged.

You can always squeeze explicitly later with the Dataset or DataArray
:py:meth:`~xarray.DataArray.squeeze` methods.

.. _groupby.multidim:

Multidimensional Grouping
~~~~~~~~~~~~~~~~~~~~~~~~~

Many datasets have a multidimensional coordinate variable (e.g. longitude)
which is different from the logical grid dimensions (e.g. nx, ny). Such
variables are valid under the `CF conventions`__. Xarray supports groupby
operations over multidimensional coordinate variables:

__ http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables

.. ipython:: python

da = xr.DataArray([[0,1],[2,3]],
coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
'lat': (['ny','nx'], [[10,10],[20,20]] ),},
dims=['ny','nx'])
da
da.groupby('lon').sum()
da.groupby('lon').apply(lambda x: x - x.mean(), shortcut=False)

Because multidimensional groups have the ability to generate a very large
number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
may be desirable:

.. ipython:: python

da.groupby_bins('lon', [0,45,50]).sum()
6 changes: 6 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ Breaking changes
Enhancements
~~~~~~~~~~~~

- Groupby operations now support grouping over multidimensional variables. A new
method called :py:meth:`~xarray.Dataset.groupby_bins` has also been added to
allow users to specify bins for grouping. The new features are described in
:ref:`groupby.multidim` and :ref:`examples.multidim`.
By `Ryan Abernathey <http://github.com/rabernat>`_.

- DataArray and Dataset method :py:meth:`where` now supports a ``drop=True``
option that clips coordinate elements that are fully masked. By
`Phillip J. Wolfram <https://github.com/pwolfram>`_.
Expand Down
Loading