Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: categorical grouping will no longer return the cartesian product #20583

Merged
merged 6 commits into from
May 1, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 51 additions & 23 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,10 @@ The mapping can be specified many different ways:
- A Python function, to be called on each of the axis labels.
- A list or NumPy array of the same length as the selected axis.
- A dict or ``Series``, providing a ``label -> group name`` mapping.
- For ``DataFrame`` objects, a string indicating a column to be used to group.
- For ``DataFrame`` objects, a string indicating a column to be used to group.
Of course ``df.groupby('A')`` is just syntactic sugar for
``df.groupby(df['A'])``, but it makes life simpler.
- For ``DataFrame`` objects, a string indicating an index level to be used to
- For ``DataFrame`` objects, a string indicating an index level to be used to
group.
- A list of any of the above things.

Expand All @@ -120,7 +120,7 @@ consider the following ``DataFrame``:
'D' : np.random.randn(8)})
df

On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`.
We could naturally group by either the ``A`` or ``B`` columns, or both:

.. ipython:: python
Expand Down Expand Up @@ -360,8 +360,8 @@ Index level names may be specified as keys directly to ``groupby``.
DataFrame column selection in GroupBy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once you have created the GroupBy object from a DataFrame, you might want to do
something different for each of the columns. Thus, using ``[]`` similar to
Once you have created the GroupBy object from a DataFrame, you might want to do
something different for each of the columns. Thus, using ``[]`` similar to
getting a column from a DataFrame, you can do:

.. ipython:: python
Expand Down Expand Up @@ -421,7 +421,7 @@ statement if you wish: ``for (k1, k2), group in grouped:``.
Selecting a group
-----------------

A single group can be selected using
A single group can be selected using
:meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:

.. ipython:: python
Expand All @@ -444,8 +444,8 @@ perform a computation on the grouped data. These operations are similar to the
:ref:`aggregating API <basics.aggregate>`, :ref:`window functions API <stats.aggregate>`,
and :ref:`resample API <timeseries.aggregate>`.

An obvious one is aggregation via the
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
An obvious one is aggregation via the
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently
:meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method:

.. ipython:: python
Expand Down Expand Up @@ -517,12 +517,12 @@ Some common aggregating functions are tabulated below:
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list
:meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values
:meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values


The aggregating functions above will exclude NA values. Any function which

The aggregating functions above will exclude NA values. Any function which
reduces a :class:`Series` to a scalar value is an aggregation function and will work,
a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a
filter, see :ref:`here <groupby.nth>`.

.. _groupby.aggregate.multifunc:
Expand Down Expand Up @@ -732,7 +732,7 @@ and that the transformed data contains no NAs.
.. note::

Some functions will automatically transform the input when applied to a
GroupBy object, but returning an object of the same shape as the original.
GroupBy object, but returning an object of the same shape as the original.
Passing ``as_index=False`` will not affect these transformation methods.

For example: ``fillna, ffill, bfill, shift.``.
Expand Down Expand Up @@ -926,7 +926,7 @@ The dimension of the returned result can also change:

In [11]: grouped.apply(f)

``apply`` on a Series can operate on a returned value from the applied function,
``apply`` on a Series can operate on a returned value from the applied function,
that is itself a series, and possibly upcast the result to a DataFrame:

.. ipython:: python
Expand Down Expand Up @@ -984,20 +984,48 @@ will be (silently) dropped. Thus, this does not pose any problems:

df.groupby('A').std()

Note that ``df.groupby('A').colname.std().`` is more efficient than
Note that ``df.groupby('A').colname.std().`` is more efficient than
``df.groupby('A').std().colname``, so if the result of an aggregation function
is only interesting over one column (here ``colname``), it may be filtered
is only interesting over one column (here ``colname``), it may be filtered
*before* applying the aggregation function.

.. _groupby.observed:

Handling of (un)observed Categorical values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use "grouper" as terminology in our documentation (except for the pd.Grouper object), so I would write "groupby key" or "to group by"

also "multipler" -> "multiple"

controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
that are observed groupers (``observed=True``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or only those that are observed groupers" -> "or only the observed categories"


Show all values:

.. ipython:: python

pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe just create s and cat to avoid repeating this a few times


Show only the observed values:

.. ipython:: python

pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()

The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catergories -> categories


.. ipython:: python

s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
s.index.dtype

.. _groupby.missing:

NA and NaT group handling
~~~~~~~~~~~~~~~~~~~~~~~~~

If there are any NaN or NaT values in the grouping key, these will be
automatically excluded. In other words, there will never be an "NA group" or
"NaT group". This was not the case in older versions of pandas, but users were
generally discarding the NA group anyway (and supporting it was an
If there are any NaN or NaT values in the grouping key, these will be
automatically excluded. In other words, there will never be an "NA group" or
"NaT group". This was not the case in older versions of pandas, but users were
generally discarding the NA group anyway (and supporting it was an
implementation headache).

Grouping with ordered factors
Expand Down Expand Up @@ -1084,8 +1112,8 @@ This shows the first or last n rows from each group.
Taking the nth row of each group
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To select from a DataFrame or Series the nth item, use
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
To select from a DataFrame or Series the nth item, use
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and
will return a single row (or no row) per group if you pass an int for n:

.. ipython:: python
Expand Down Expand Up @@ -1153,7 +1181,7 @@ Enumerate groups
.. versionadded:: 0.20.2

To see the ordering of the groups (as opposed to the order of rows
within a group given by ``cumcount``) you can use
within a group given by ``cumcount``) you can use
:meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.


Expand Down Expand Up @@ -1273,7 +1301,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
Multi-column factorization
~~~~~~~~~~~~~~~~~~~~~~~~~~

By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
information about the groups in a way similar to :func:`factorize` (as described
further in the :ref:`reshaping API <reshaping.factorize>`) but which applies
naturally to multiple columns of mixed type and different
Expand Down
52 changes: 52 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our

.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/

.. _whatsnew_0230.enhancements.categorical_grouping:

Categorical Groupers has gained an observed keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has -> have? Because "categorical Groupers" is plural right?

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To repeat my previous comment: I would not use the "cartesian product" to introduce this. The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this on purpose, this is more correct.

Copy link
Contributor

@TomAugspurger TomAugspurger May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cartesian product" really only makes sense in the 2 or more case, right? But you say "1 or more" above. I would phrase it as

"Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categories, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in high memory usage."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the explanation of Tom is exactly what I meant.

@jreback I have no problem at all with that you don't agree with a comment (it would be strange otherwise :-)) and thus not update for it, but can you then answer to that comment noting that? Otherwise I cannot know that I should not repeat a comment (or that I shouldn't get annoyed with my comments being ignored :))

each grouper, not just the observed values.``.groupby()`` has gained the ``observed`` keyword to toggle this behavior. The default remains backward
compatible (generate a cartesian product). (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`)


.. ipython:: python

cat1 = pd.Categorical(["a", "a", "b", "b"],
categories=["a", "b", "z"], ordered=True)
cat2 = pd.Categorical(["c", "d", "c", "d"],
categories=["c", "d", "y"], ordered=True)
df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
df['C'] = ['foo', 'bar'] * 2
df

To show all values, the previous behavior:

.. ipython:: python

df.groupby(['A', 'B', 'C'], observed=False).count()


To show only observed values:

.. ipython:: python

df.groupby(['A', 'B', 'C'], observed=True).count()

For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword:

.. ipython:: python

cat1 = pd.Categorical(["a", "a", "b", "b"],
categories=["a", "b", "z"], ordered=True)
cat2 = pd.Categorical(["c", "d", "c", "d"],
categories=["c", "d", "y"], ordered=True)
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
df

.. ipython:: python

pd.pivot_table(df, values='values', index=['A', 'B'],
dropna=True)
pd.pivot_table(df, values='values', index=['A', 'B'],
dropna=False)


.. _whatsnew_0230.enhancements.other:

Other Enhancements
Expand Down
11 changes: 11 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,17 @@ def ip():
return InteractiveShell()


@pytest.fixture(params=[True, False, None])
def observed(request):
""" pass in the observed keyword to groupby for [True, False]
This indicates whether categoricals should return values for
values which are not in the grouper [False / None], or only values which
appear in the grouper [True]. [None] is supported for future compatiblity
if we decide to change the default (and would need to warn if this
parameter is not passed)"""
return request.param


@pytest.fixture(params=[None, 'gzip', 'bz2', 'zip',
pytest.param('xz', marks=td.skip_if_no_lzma)])
def compression(request):
Expand Down
31 changes: 29 additions & 2 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -647,8 +647,13 @@ def _set_categories(self, categories, fastpath=False):

self._dtype = new_dtype

def _codes_for_groupby(self, sort):
def _codes_for_groupby(self, sort, observed):
"""
Code the categories to ensure we can groupby for categoricals.

If observed=True, we return a new Categorical with the observed
categories only.

If sort=False, return a copy of self, coded with categories as
returned by .unique(), followed by any categories not appearing in
the data. If sort=True, return self.
Expand All @@ -661,6 +666,8 @@ def _codes_for_groupby(self, sort):
----------
sort : boolean
The value of the sort parameter groupby was called with.
observed : boolean
Account only for the observed values

Returns
-------
Expand All @@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort):
categories in the original order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this docstring?

"""

# we only care about observed values
if observed:
unique_codes = unique1d(self.codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't thought this through, but can this if block be replaced with self.remove_unused_cateogories()._codes_for_groupby(sort=sort, observed=False)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you actually need the uniques

cat = self.copy()

take_codes = unique_codes[unique_codes != -1]
if self.ordered:
take_codes = np.sort(take_codes)

# we recode according to the uniques
categories = self.categories.take(take_codes)
codes = _recode_for_categories(self.codes,
self.categories,
categories)

# return a new categorical that maps our new codes
# and categories
dtype = CategoricalDtype(categories, ordered=self.ordered)
return type(self)(codes, dtype=dtype, fastpath=True)

# Already sorted according to self.categories; all is fine
if sort:
return self
Expand Down Expand Up @@ -2161,7 +2188,7 @@ def unique(self):
# exclude nan from indexer for categories
take_codes = unique_codes[unique_codes != -1]
if self.ordered:
take_codes = sorted(take_codes)
take_codes = np.sort(take_codes)
return cat.set_categories(cat.categories.take(take_codes))

def _values_for_factorize(self):
Expand Down
11 changes: 9 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -6599,7 +6599,7 @@ def clip_lower(self, threshold, axis=None, inplace=False):
axis=axis, inplace=inplace)

def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs):
group_keys=True, squeeze=False, observed=None, **kwargs):
"""
Group series using mapper (dict or key function, apply given function
to group, return result as series) or by a series of columns.
Expand Down Expand Up @@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
observed : boolean, default None
if True: only show observed values for categorical groupers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capital If (below as well)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you start this explanation with noting this keyword is only when grouping by categorical values?

if False: show all values for categorical groupers.
if None: if any categorical groupers, show a FutureWarning,
default to False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no identation for rst formatting


.. versionadded:: 0.23.0

Returns
-------
Expand Down Expand Up @@ -6665,7 +6672,7 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
axis = self._get_axis_number(axis)
return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
sort=sort, group_keys=group_keys, squeeze=squeeze,
**kwargs)
observed=observed, **kwargs)

def asfreq(self, freq, method=None, how=None, normalize=False,
fill_value=None):
Expand Down
Loading