Skip to content

Commit

Permalink
API: categorical grouping will no longer return the cartesian product (
Browse files Browse the repository at this point in the history
…#20583)

* BUG: groupby with categorical and other columns

closes #14942
  • Loading branch information
jreback authored and TomAugspurger committed May 1, 2018
1 parent 901fc64 commit b020891
Show file tree
Hide file tree
Showing 15 changed files with 748 additions and 419 deletions.
74 changes: 51 additions & 23 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,10 @@ The mapping can be specified many different ways:
- A Python function, to be called on each of the axis labels.
- A list or NumPy array of the same length as the selected axis.
- A dict or ``Series``, providing a ``label -> group name`` mapping.
- For ``DataFrame`` objects, a string indicating a column to be used to group.
- For ``DataFrame`` objects, a string indicating a column to be used to group.
Of course ``df.groupby('A')`` is just syntactic sugar for
``df.groupby(df['A'])``, but it makes life simpler.
- For ``DataFrame`` objects, a string indicating an index level to be used to
- For ``DataFrame`` objects, a string indicating an index level to be used to
group.
- A list of any of the above things.

Expand All @@ -120,7 +120,7 @@ consider the following ``DataFrame``:
'D' : np.random.randn(8)})
df
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
We could naturally group by either the ``A`` or ``B`` columns, or both:

.. ipython:: python
Expand Down Expand Up @@ -360,8 +360,8 @@ Index level names may be specified as keys directly to ``groupby``.
DataFrame column selection in GroupBy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once you have created the GroupBy object from a DataFrame, you might want to do
something different for each of the columns. Thus, using ``[]`` similar to
Once you have created the GroupBy object from a DataFrame, you might want to do
something different for each of the columns. Thus, using ``[]`` similar to
getting a column from a DataFrame, you can do:

.. ipython:: python
Expand Down Expand Up @@ -421,7 +421,7 @@ statement if you wish: ``for (k1, k2), group in grouped:``.
Selecting a group
-----------------

A single group can be selected using
A single group can be selected using
:meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:

.. ipython:: python
Expand All @@ -444,8 +444,8 @@ perform a computation on the grouped data. These operations are similar to the
:ref:`aggregating API <basics.aggregate>`, :ref:`window functions API <stats.aggregate>`,
and :ref:`resample API <timeseries.aggregate>`.

An obvious one is aggregation via the
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
An obvious one is aggregation via the
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
:meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method:

.. ipython:: python
Expand Down Expand Up @@ -517,12 +517,12 @@ Some common aggregating functions are tabulated below:
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
:meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
:meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values


The aggregating functions above will exclude NA values. Any function which

The aggregating functions above will exclude NA values. Any function which
reduces a :class:`Series` to a scalar value is an aggregation function and will work,
a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
filter, see :ref:`here <groupby.nth>`.

.. _groupby.aggregate.multifunc:
Expand Down Expand Up @@ -732,7 +732,7 @@ and that the transformed data contains no NAs.
.. note::

Some functions will automatically transform the input when applied to a
GroupBy object, but returning an object of the same shape as the original.
GroupBy object, but returning an object of the same shape as the original.
Passing ``as_index=False`` will not affect these transformation methods.

For example: ``fillna, ffill, bfill, shift.``.
Expand Down Expand Up @@ -926,7 +926,7 @@ The dimension of the returned result can also change:

In [11]: grouped.apply(f)

``apply`` on a Series can operate on a returned value from the applied function,
``apply`` on a Series can operate on a returned value from the applied function,
that is itself a series, and possibly upcast the result to a DataFrame:

.. ipython:: python
Expand Down Expand Up @@ -984,20 +984,48 @@ will be (silently) dropped. Thus, this does not pose any problems:
df.groupby('A').std()
Note that ``df.groupby('A').colname.std().`` is more efficient than
Note that ``df.groupby('A').colname.std().`` is more efficient than
``df.groupby('A').std().colname``, so if the result of an aggregation function
is only interesting over one column (here ``colname``), it may be filtered
is only interesting over one column (here ``colname``), it may be filtered
*before* applying the aggregation function.

.. _groupby.observed:

Handling of (un)observed Categorical values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
that are observed groupers (``observed=True``).

Show all values:

.. ipython:: python
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
Show only the observed values:

.. ipython:: python
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.

.. ipython:: python
s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
s.index.dtype
.. _groupby.missing:

NA and NaT group handling
~~~~~~~~~~~~~~~~~~~~~~~~~

If there are any NaN or NaT values in the grouping key, these will be
automatically excluded. In other words, there will never be an "NA group" or
"NaT group". This was not the case in older versions of pandas, but users were
generally discarding the NA group anyway (and supporting it was an
If there are any NaN or NaT values in the grouping key, these will be
automatically excluded. In other words, there will never be an "NA group" or
"NaT group". This was not the case in older versions of pandas, but users were
generally discarding the NA group anyway (and supporting it was an
implementation headache).

Grouping with ordered factors
Expand Down Expand Up @@ -1084,8 +1112,8 @@ This shows the first or last n rows from each group.
Taking the nth row of each group
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To select from a DataFrame or Series the nth item, use
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
To select from a DataFrame or Series the nth item, use
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
will return a single row (or no row) per group if you pass an int for n:

.. ipython:: python
Expand Down Expand Up @@ -1153,7 +1181,7 @@ Enumerate groups
.. versionadded:: 0.20.2

To see the ordering of the groups (as opposed to the order of rows
within a group given by ``cumcount``) you can use
within a group given by ``cumcount``) you can use
:meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.


Expand Down Expand Up @@ -1273,7 +1301,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
Multi-column factorization
~~~~~~~~~~~~~~~~~~~~~~~~~~
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
information about the groups in a way similar to :func:`factorize` (as described
further in the :ref:`reshaping API <reshaping.factorize>`) but which applies
naturally to multiple columns of mixed type and different
Expand Down
52 changes: 52 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our

.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/

.. _whatsnew_0230.enhancements.categorical_grouping:

Categorical Groupers has gained an observed keyword
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for
each grouper, not just the observed values.``.groupby()`` has gained the ``observed`` keyword to toggle this behavior. The default remains backward
compatible (generate a cartesian product). (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`)


.. ipython:: python

cat1 = pd.Categorical(["a", "a", "b", "b"],
categories=["a", "b", "z"], ordered=True)
cat2 = pd.Categorical(["c", "d", "c", "d"],
categories=["c", "d", "y"], ordered=True)
df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
df['C'] = ['foo', 'bar'] * 2
df

To show all values, the previous behavior:

.. ipython:: python

df.groupby(['A', 'B', 'C'], observed=False).count()


To show only observed values:

.. ipython:: python

df.groupby(['A', 'B', 'C'], observed=True).count()

For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword:

.. ipython:: python

cat1 = pd.Categorical(["a", "a", "b", "b"],
categories=["a", "b", "z"], ordered=True)
cat2 = pd.Categorical(["c", "d", "c", "d"],
categories=["c", "d", "y"], ordered=True)
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
df

.. ipython:: python

pd.pivot_table(df, values='values', index=['A', 'B'],
dropna=True)
pd.pivot_table(df, values='values', index=['A', 'B'],
dropna=False)


.. _whatsnew_0230.enhancements.other:

Other Enhancements
Expand Down
11 changes: 11 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,17 @@ def ip():
return InteractiveShell()


@pytest.fixture(params=[True, False, None])
def observed(request):
""" pass in the observed keyword to groupby for [True, False]
This indicates whether categoricals should return values for
values which are not in the grouper [False / None], or only values which
appear in the grouper [True]. [None] is supported for future compatiblity
if we decide to change the default (and would need to warn if this
parameter is not passed)"""
return request.param


@pytest.fixture(params=[None, 'gzip', 'bz2', 'zip',
pytest.param('xz', marks=td.skip_if_no_lzma)])
def compression(request):
Expand Down
31 changes: 29 additions & 2 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -647,8 +647,13 @@ def _set_categories(self, categories, fastpath=False):

self._dtype = new_dtype

def _codes_for_groupby(self, sort):
def _codes_for_groupby(self, sort, observed):
"""
Code the categories to ensure we can groupby for categoricals.
If observed=True, we return a new Categorical with the observed
categories only.
If sort=False, return a copy of self, coded with categories as
returned by .unique(), followed by any categories not appearing in
the data. If sort=True, return self.
Expand All @@ -661,6 +666,8 @@ def _codes_for_groupby(self, sort):
----------
sort : boolean
The value of the sort parameter groupby was called with.
observed : boolean
Account only for the observed values
Returns
-------
Expand All @@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort):
categories in the original order.
"""

# we only care about observed values
if observed:
unique_codes = unique1d(self.codes)
cat = self.copy()

take_codes = unique_codes[unique_codes != -1]
if self.ordered:
take_codes = np.sort(take_codes)

# we recode according to the uniques
categories = self.categories.take(take_codes)
codes = _recode_for_categories(self.codes,
self.categories,
categories)

# return a new categorical that maps our new codes
# and categories
dtype = CategoricalDtype(categories, ordered=self.ordered)
return type(self)(codes, dtype=dtype, fastpath=True)

# Already sorted according to self.categories; all is fine
if sort:
return self
Expand Down Expand Up @@ -2161,7 +2188,7 @@ def unique(self):
# exclude nan from indexer for categories
take_codes = unique_codes[unique_codes != -1]
if self.ordered:
take_codes = sorted(take_codes)
take_codes = np.sort(take_codes)
return cat.set_categories(cat.categories.take(take_codes))

def _values_for_factorize(self):
Expand Down
11 changes: 9 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -6599,7 +6599,7 @@ def clip_lower(self, threshold, axis=None, inplace=False):
axis=axis, inplace=inplace)

def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs):
group_keys=True, squeeze=False, observed=None, **kwargs):
"""
Group series using mapper (dict or key function, apply given function
to group, return result as series) or by a series of columns.
Expand Down Expand Up @@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
observed : boolean, default None
if True: only show observed values for categorical groupers.
if False: show all values for categorical groupers.
if None: if any categorical groupers, show a FutureWarning,
default to False.
.. versionadded:: 0.23.0
Returns
-------
Expand Down Expand Up @@ -6665,7 +6672,7 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
axis = self._get_axis_number(axis)
return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
sort=sort, group_keys=group_keys, squeeze=squeeze,
**kwargs)
observed=observed, **kwargs)

def asfreq(self, freq, method=None, how=None, normalize=False,
fill_value=None):
Expand Down
Loading

0 comments on commit b020891

Please sign in to comment.