API: categorical grouping will no longer return the cartesian product (…

…#20583) * BUG: groupby with categorical and other columns closes #14942
pandas-dev · May 1, 2018 · b020891 · b020891
1 parent 901fc64
commit b020891
Show file tree

Hide file tree

Showing 15 changed files with 748 additions and 419 deletions.
diff --git a/doc/source/groupby.rst b/doc/source/groupby.rst
@@ -91,10 +91,10 @@ The mapping can be specified many different ways:
   - A Python function, to be called on each of the axis labels.
   - A list or NumPy array of the same length as the selected axis.
   - A dict or ``Series``, providing a ``label -> group name`` mapping.
-  - For ``DataFrame`` objects, a string indicating a column to be used to group. 
+  - For ``DataFrame`` objects, a string indicating a column to be used to group.
     Of course ``df.groupby('A')`` is just syntactic sugar for
     ``df.groupby(df['A'])``, but it makes life simpler.
-  - For ``DataFrame`` objects, a string indicating an index level to be used to 
+  - For ``DataFrame`` objects, a string indicating an index level to be used to
     group.
   - A list of any of the above things.
 
@@ -120,7 +120,7 @@ consider the following ``DataFrame``:
                       'D' : np.random.randn(8)})
    df
 
-On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`. 
+On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
 We could naturally group by either the ``A`` or ``B`` columns, or both:
 
 .. ipython:: python
@@ -360,8 +360,8 @@ Index level names may be specified as keys directly to ``groupby``.
 DataFrame column selection in GroupBy
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Once you have created the GroupBy object from a DataFrame, you might want to do 
-something different for each of the columns. Thus, using ``[]`` similar to 
+Once you have created the GroupBy object from a DataFrame, you might want to do
+something different for each of the columns. Thus, using ``[]`` similar to
 getting a column from a DataFrame, you can do:
 
 .. ipython:: python
@@ -421,7 +421,7 @@ statement if you wish: ``for (k1, k2), group in grouped:``.
 Selecting a group
 -----------------
 
-A single group can be selected using 
+A single group can be selected using
 :meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:
 
 .. ipython:: python
@@ -444,8 +444,8 @@ perform a computation on the grouped data. These operations are similar to the
 :ref:`aggregating API <basics.aggregate>`, :ref:`window functions API <stats.aggregate>`,
 and :ref:`resample API <timeseries.aggregate>`.
 
-An obvious one is aggregation via the 
-:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently 
+An obvious one is aggregation via the
+:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
 :meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method:
 
 .. ipython:: python
@@ -517,12 +517,12 @@ Some common aggregating functions are tabulated below:
 	:meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
 	:meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
 	:meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values
-
 
-The aggregating functions above will exclude NA values. Any function which 
+
+The aggregating functions above will exclude NA values. Any function which
 reduces a :class:`Series` to a scalar value is an aggregation function and will work,
 a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that
-:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a 
+:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
 filter, see :ref:`here <groupby.nth>`.
 
 .. _groupby.aggregate.multifunc:
@@ -732,7 +732,7 @@ and that the transformed data contains no NAs.
 .. note::
 
    Some functions will automatically transform the input when applied to a
-   GroupBy object, but returning an object of the same shape as the original. 
+   GroupBy object, but returning an object of the same shape as the original.
    Passing ``as_index=False`` will not affect these transformation methods.
 
    For example: ``fillna, ffill, bfill, shift.``.
@@ -926,7 +926,7 @@ The dimension of the returned result can also change:
 
     In [11]: grouped.apply(f)
 
-``apply`` on a Series can operate on a returned value from the applied function, 
+``apply`` on a Series can operate on a returned value from the applied function,
 that is itself a series, and possibly upcast the result to a DataFrame:
 
 .. ipython:: python
@@ -984,20 +984,48 @@ will be (silently) dropped. Thus, this does not pose any problems:
 
    df.groupby('A').std()
 
-Note that ``df.groupby('A').colname.std().`` is more efficient than 
+Note that ``df.groupby('A').colname.std().`` is more efficient than
 ``df.groupby('A').std().colname``, so if the result of an aggregation function
-is only interesting over one column (here ``colname``), it may be filtered 
+is only interesting over one column (here ``colname``), it may be filtered
 *before* applying the aggregation function.
 
+.. _groupby.observed:
+
+Handling of (un)observed Categorical values
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
+controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
+that are observed groupers (``observed=True``).
+
+Show all values:
+
+.. ipython:: python
+
+   pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
+
+Show only the observed values:
+
+.. ipython:: python
+
+   pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()
+
+The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.
+
+.. ipython:: python
+
+   s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
+   s.index.dtype
+
 .. _groupby.missing:
 
 NA and NaT group handling
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If there are any NaN or NaT values in the grouping key, these will be 
-automatically excluded. In other words, there will never be an "NA group" or 
-"NaT group". This was not the case in older versions of pandas, but users were 
-generally discarding the NA group anyway (and supporting it was an 
+If there are any NaN or NaT values in the grouping key, these will be
+automatically excluded. In other words, there will never be an "NA group" or
+"NaT group". This was not the case in older versions of pandas, but users were
+generally discarding the NA group anyway (and supporting it was an
 implementation headache).
 
 Grouping with ordered factors
@@ -1084,8 +1112,8 @@ This shows the first or last n rows from each group.
 Taking the nth row of each group
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-To select from a DataFrame or Series the nth item, use 
-:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and 
+To select from a DataFrame or Series the nth item, use
+:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
 will return a single row (or no row) per group if you pass an int for n:
 
 .. ipython:: python
@@ -1153,7 +1181,7 @@ Enumerate groups
 .. versionadded:: 0.20.2
 
 To see the ordering of the groups (as opposed to the order of rows
-within a group given by ``cumcount``) you can use 
+within a group given by ``cumcount``) you can use
 :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.
 
 
@@ -1273,7 +1301,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
 Multi-column factorization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract 
+By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
 information about the groups in a way similar to :func:`factorize` (as described
 further in the :ref:`reshaping API <reshaping.factorize>`) but which applies
 naturally to multiple columns of mixed type and different

diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt
@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our
 
 .. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/
 
+.. _whatsnew_0230.enhancements.categorical_grouping:
+
+Categorical Groupers has gained an observed keyword
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for
+each grouper, not just the observed values.``.groupby()`` has gained the ``observed`` keyword to toggle this behavior. The default remains backward
+compatible (generate a cartesian product). (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`)
+
+
+.. ipython:: python
+
+   cat1 = pd.Categorical(["a", "a", "b", "b"],
+                         categories=["a", "b", "z"], ordered=True)
+   cat2 = pd.Categorical(["c", "d", "c", "d"],
+                         categories=["c", "d", "y"], ordered=True)
+   df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
+   df['C'] = ['foo', 'bar'] * 2
+   df
+
+To show all values, the previous behavior:
+
+.. ipython:: python
+
+   df.groupby(['A', 'B', 'C'], observed=False).count()
+
+
+To show only observed values:
+
+.. ipython:: python
+
+   df.groupby(['A', 'B', 'C'], observed=True).count()
+
+For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword:
+
+.. ipython:: python
+
+   cat1 = pd.Categorical(["a", "a", "b", "b"],
+                         categories=["a", "b", "z"], ordered=True)
+   cat2 = pd.Categorical(["c", "d", "c", "d"],
+                         categories=["c", "d", "y"], ordered=True)
+   df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
+   df
+
+.. ipython:: python
+
+   pd.pivot_table(df, values='values', index=['A', 'B'],
+                  dropna=True)
+   pd.pivot_table(df, values='values', index=['A', 'B'],
+                  dropna=False)
+
+
 .. _whatsnew_0230.enhancements.other:
 
 Other Enhancements

diff --git a/pandas/conftest.py b/pandas/conftest.py
@@ -66,6 +66,17 @@ def ip():
     return InteractiveShell()
 
 
+@pytest.fixture(params=[True, False, None])
+def observed(request):
+    """ pass in the observed keyword to groupby for [True, False]
+    This indicates whether categoricals should return values for
+    values which are not in the grouper [False / None], or only values which
+    appear in the grouper [True]. [None] is supported for future compatiblity
+    if we decide to change the default (and would need to warn if this
+    parameter is not passed)"""
+    return request.param
+
+
 @pytest.fixture(params=[None, 'gzip', 'bz2', 'zip',
                         pytest.param('xz', marks=td.skip_if_no_lzma)])
 def compression(request):

diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py
@@ -647,8 +647,13 @@ def _set_categories(self, categories, fastpath=False):
 
         self._dtype = new_dtype
 
-    def _codes_for_groupby(self, sort):
+    def _codes_for_groupby(self, sort, observed):
         """
+        Code the categories to ensure we can groupby for categoricals.
+
+        If observed=True, we return a new Categorical with the observed
+        categories only.
+
         If sort=False, return a copy of self, coded with categories as
         returned by .unique(), followed by any categories not appearing in
         the data. If sort=True, return self.
@@ -661,6 +666,8 @@ def _codes_for_groupby(self, sort):
         ----------
         sort : boolean
             The value of the sort parameter groupby was called with.
+        observed : boolean
+            Account only for the observed values
 
         Returns
         -------
@@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort):
             categories in the original order.
         """
 
+        # we only care about observed values
+        if observed:
+            unique_codes = unique1d(self.codes)
+            cat = self.copy()
+
+            take_codes = unique_codes[unique_codes != -1]
+            if self.ordered:
+                take_codes = np.sort(take_codes)
+
+            # we recode according to the uniques
+            categories = self.categories.take(take_codes)
+            codes = _recode_for_categories(self.codes,
+                                           self.categories,
+                                           categories)
+
+            # return a new categorical that maps our new codes
+            # and categories
+            dtype = CategoricalDtype(categories, ordered=self.ordered)
+            return type(self)(codes, dtype=dtype, fastpath=True)
+
         # Already sorted according to self.categories; all is fine
         if sort:
             return self
@@ -2161,7 +2188,7 @@ def unique(self):
         # exclude nan from indexer for categories
         take_codes = unique_codes[unique_codes != -1]
         if self.ordered:
-            take_codes = sorted(take_codes)
+            take_codes = np.sort(take_codes)
         return cat.set_categories(cat.categories.take(take_codes))
 
     def _values_for_factorize(self):

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
@@ -6599,7 +6599,7 @@ def clip_lower(self, threshold, axis=None, inplace=False):
                                          axis=axis, inplace=inplace)
 
     def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
-                group_keys=True, squeeze=False, **kwargs):
+                group_keys=True, squeeze=False, observed=None, **kwargs):
         """
         Group series using mapper (dict or key function, apply given function
         to group, return result as series) or by a series of columns.
@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
         squeeze : boolean, default False
             reduce the dimensionality of the return type if possible,
             otherwise return a consistent type
+        observed : boolean, default None
+            if True: only show observed values for categorical groupers.
+            if False: show all values for categorical groupers.
+            if None: if any categorical groupers, show a FutureWarning,
+                default to False.
+
+            .. versionadded:: 0.23.0
 
         Returns
         -------
@@ -6665,7 +6672,7 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
         axis = self._get_axis_number(axis)
         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-                       **kwargs)
+                       observed=observed, **kwargs)
 
     def asfreq(self, freq, method=None, how=None, normalize=False,
                fill_value=None):