-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: categorical grouping will no longer return the cartesian product #20583
Changes from all commits
fa532b6
144a63d
19c9cf7
7ae10ba
bdb7ad3
bdf7525
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -91,10 +91,10 @@ The mapping can be specified many different ways: | |
- A Python function, to be called on each of the axis labels. | ||
- A list or NumPy array of the same length as the selected axis. | ||
- A dict or ``Series``, providing a ``label -> group name`` mapping. | ||
- For ``DataFrame`` objects, a string indicating a column to be used to group. | ||
- For ``DataFrame`` objects, a string indicating a column to be used to group. | ||
Of course ``df.groupby('A')`` is just syntactic sugar for | ||
``df.groupby(df['A'])``, but it makes life simpler. | ||
- For ``DataFrame`` objects, a string indicating an index level to be used to | ||
- For ``DataFrame`` objects, a string indicating an index level to be used to | ||
group. | ||
- A list of any of the above things. | ||
|
||
|
@@ -120,7 +120,7 @@ consider the following ``DataFrame``: | |
'D' : np.random.randn(8)}) | ||
df | ||
|
||
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`. | ||
On a DataFrame, we obtain a GroupBy object by calling :meth:`~DataFrame.groupby`. | ||
We could naturally group by either the ``A`` or ``B`` columns, or both: | ||
|
||
.. ipython:: python | ||
|
@@ -360,8 +360,8 @@ Index level names may be specified as keys directly to ``groupby``. | |
DataFrame column selection in GroupBy | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Once you have created the GroupBy object from a DataFrame, you might want to do | ||
something different for each of the columns. Thus, using ``[]`` similar to | ||
Once you have created the GroupBy object from a DataFrame, you might want to do | ||
something different for each of the columns. Thus, using ``[]`` similar to | ||
getting a column from a DataFrame, you can do: | ||
|
||
.. ipython:: python | ||
|
@@ -421,7 +421,7 @@ statement if you wish: ``for (k1, k2), group in grouped:``. | |
Selecting a group | ||
----------------- | ||
|
||
A single group can be selected using | ||
A single group can be selected using | ||
:meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`: | ||
|
||
.. ipython:: python | ||
|
@@ -444,8 +444,8 @@ perform a computation on the grouped data. These operations are similar to the | |
:ref:`aggregating API <basics.aggregate>`, :ref:`window functions API <stats.aggregate>`, | ||
and :ref:`resample API <timeseries.aggregate>`. | ||
|
||
An obvious one is aggregation via the | ||
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently | ||
An obvious one is aggregation via the | ||
:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently | ||
:meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method: | ||
|
||
.. ipython:: python | ||
|
@@ -517,12 +517,12 @@ Some common aggregating functions are tabulated below: | |
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list | ||
:meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values | ||
:meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values | ||
|
||
|
||
The aggregating functions above will exclude NA values. Any function which | ||
|
||
The aggregating functions above will exclude NA values. Any function which | ||
reduces a :class:`Series` to a scalar value is an aggregation function and will work, | ||
a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that | ||
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a | ||
:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a | ||
filter, see :ref:`here <groupby.nth>`. | ||
|
||
.. _groupby.aggregate.multifunc: | ||
|
@@ -732,7 +732,7 @@ and that the transformed data contains no NAs. | |
.. note:: | ||
|
||
Some functions will automatically transform the input when applied to a | ||
GroupBy object, but returning an object of the same shape as the original. | ||
GroupBy object, but returning an object of the same shape as the original. | ||
Passing ``as_index=False`` will not affect these transformation methods. | ||
|
||
For example: ``fillna, ffill, bfill, shift.``. | ||
|
@@ -926,7 +926,7 @@ The dimension of the returned result can also change: | |
|
||
In [11]: grouped.apply(f) | ||
|
||
``apply`` on a Series can operate on a returned value from the applied function, | ||
``apply`` on a Series can operate on a returned value from the applied function, | ||
that is itself a series, and possibly upcast the result to a DataFrame: | ||
|
||
.. ipython:: python | ||
|
@@ -984,20 +984,48 @@ will be (silently) dropped. Thus, this does not pose any problems: | |
|
||
df.groupby('A').std() | ||
|
||
Note that ``df.groupby('A').colname.std().`` is more efficient than | ||
Note that ``df.groupby('A').colname.std().`` is more efficient than | ||
``df.groupby('A').std().colname``, so if the result of an aggregation function | ||
is only interesting over one column (here ``colname``), it may be filtered | ||
is only interesting over one column (here ``colname``), it may be filtered | ||
*before* applying the aggregation function. | ||
|
||
.. _groupby.observed: | ||
|
||
Handling of (un)observed Categorical values | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword | ||
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those | ||
that are observed groupers (``observed=True``). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "or only those that are observed groupers" -> "or only the observed categories" |
||
|
||
Show all values: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would maybe just create |
||
|
||
Show only the observed values: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count() | ||
|
||
The returned dtype of the grouped will *always* include *all* of the catergories that were grouped. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. catergories -> categories |
||
|
||
.. ipython:: python | ||
|
||
s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count() | ||
s.index.dtype | ||
|
||
.. _groupby.missing: | ||
|
||
NA and NaT group handling | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
If there are any NaN or NaT values in the grouping key, these will be | ||
automatically excluded. In other words, there will never be an "NA group" or | ||
"NaT group". This was not the case in older versions of pandas, but users were | ||
generally discarding the NA group anyway (and supporting it was an | ||
If there are any NaN or NaT values in the grouping key, these will be | ||
automatically excluded. In other words, there will never be an "NA group" or | ||
"NaT group". This was not the case in older versions of pandas, but users were | ||
generally discarding the NA group anyway (and supporting it was an | ||
implementation headache). | ||
|
||
Grouping with ordered factors | ||
|
@@ -1084,8 +1112,8 @@ This shows the first or last n rows from each group. | |
Taking the nth row of each group | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
To select from a DataFrame or Series the nth item, use | ||
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and | ||
To select from a DataFrame or Series the nth item, use | ||
:meth:`~pd.core.groupby.DataFrameGroupBy.nth`. This is a reduction method, and | ||
will return a single row (or no row) per group if you pass an int for n: | ||
|
||
.. ipython:: python | ||
|
@@ -1153,7 +1181,7 @@ Enumerate groups | |
.. versionadded:: 0.20.2 | ||
|
||
To see the ordering of the groups (as opposed to the order of rows | ||
within a group given by ``cumcount``) you can use | ||
within a group given by ``cumcount``) you can use | ||
:meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`. | ||
|
||
|
||
|
@@ -1273,7 +1301,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on | |
Multi-column factorization | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract | ||
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract | ||
information about the groups in a way similar to :func:`factorize` (as described | ||
further in the :ref:`reshaping API <reshaping.factorize>`) but which applies | ||
naturally to multiple columns of mixed type and different | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our | |
|
||
.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/ | ||
|
||
.. _whatsnew_0230.enhancements.categorical_grouping: | ||
|
||
Categorical Groupers has gained an observed keyword | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. has -> have? Because "categorical Groupers" is plural right? |
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To repeat my previous comment: I would not use the "cartesian product" to introduce this. The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't change this on purpose, this is more correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Cartesian product" really only makes sense in the 2 or more case, right? But you say "1 or more" above. I would phrase it as "Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categories, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in high memory usage." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, the explanation of Tom is exactly what I meant. @jreback I have no problem at all with that you don't agree with a comment (it would be strange otherwise :-)) and thus not update for it, but can you then answer to that comment noting that? Otherwise I cannot know that I should not repeat a comment (or that I shouldn't get annoyed with my comments being ignored :)) |
||
each grouper, not just the observed values.``.groupby()`` has gained the ``observed`` keyword to toggle this behavior. The default remains backward | ||
compatible (generate a cartesian product). (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`) | ||
|
||
|
||
.. ipython:: python | ||
|
||
cat1 = pd.Categorical(["a", "a", "b", "b"], | ||
categories=["a", "b", "z"], ordered=True) | ||
cat2 = pd.Categorical(["c", "d", "c", "d"], | ||
categories=["c", "d", "y"], ordered=True) | ||
df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) | ||
df['C'] = ['foo', 'bar'] * 2 | ||
df | ||
|
||
To show all values, the previous behavior: | ||
|
||
.. ipython:: python | ||
|
||
df.groupby(['A', 'B', 'C'], observed=False).count() | ||
|
||
|
||
To show only observed values: | ||
|
||
.. ipython:: python | ||
|
||
df.groupby(['A', 'B', 'C'], observed=True).count() | ||
|
||
For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword: | ||
|
||
.. ipython:: python | ||
|
||
cat1 = pd.Categorical(["a", "a", "b", "b"], | ||
categories=["a", "b", "z"], ordered=True) | ||
cat2 = pd.Categorical(["c", "d", "c", "d"], | ||
categories=["c", "d", "y"], ordered=True) | ||
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) | ||
df | ||
|
||
.. ipython:: python | ||
|
||
pd.pivot_table(df, values='values', index=['A', 'B'], | ||
dropna=True) | ||
pd.pivot_table(df, values='values', index=['A', 'B'], | ||
dropna=False) | ||
|
||
|
||
.. _whatsnew_0230.enhancements.other: | ||
|
||
Other Enhancements | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -647,8 +647,13 @@ def _set_categories(self, categories, fastpath=False): | |
|
||
self._dtype = new_dtype | ||
|
||
def _codes_for_groupby(self, sort): | ||
def _codes_for_groupby(self, sort, observed): | ||
""" | ||
Code the categories to ensure we can groupby for categoricals. | ||
|
||
If observed=True, we return a new Categorical with the observed | ||
categories only. | ||
|
||
If sort=False, return a copy of self, coded with categories as | ||
returned by .unique(), followed by any categories not appearing in | ||
the data. If sort=True, return self. | ||
|
@@ -661,6 +666,8 @@ def _codes_for_groupby(self, sort): | |
---------- | ||
sort : boolean | ||
The value of the sort parameter groupby was called with. | ||
observed : boolean | ||
Account only for the observed values | ||
|
||
Returns | ||
------- | ||
|
@@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort): | |
categories in the original order. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you update this docstring? |
||
""" | ||
|
||
# we only care about observed values | ||
if observed: | ||
unique_codes = unique1d(self.codes) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Haven't thought this through, but can this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, you actually need the uniques |
||
cat = self.copy() | ||
|
||
take_codes = unique_codes[unique_codes != -1] | ||
if self.ordered: | ||
take_codes = np.sort(take_codes) | ||
|
||
# we recode according to the uniques | ||
categories = self.categories.take(take_codes) | ||
codes = _recode_for_categories(self.codes, | ||
self.categories, | ||
categories) | ||
|
||
# return a new categorical that maps our new codes | ||
# and categories | ||
dtype = CategoricalDtype(categories, ordered=self.ordered) | ||
return type(self)(codes, dtype=dtype, fastpath=True) | ||
|
||
# Already sorted according to self.categories; all is fine | ||
if sort: | ||
return self | ||
|
@@ -2161,7 +2188,7 @@ def unique(self): | |
# exclude nan from indexer for categories | ||
take_codes = unique_codes[unique_codes != -1] | ||
if self.ordered: | ||
take_codes = sorted(take_codes) | ||
take_codes = np.sort(take_codes) | ||
return cat.set_categories(cat.categories.take(take_codes)) | ||
|
||
def _values_for_factorize(self): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6599,7 +6599,7 @@ def clip_lower(self, threshold, axis=None, inplace=False): | |
axis=axis, inplace=inplace) | ||
|
||
def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, | ||
group_keys=True, squeeze=False, **kwargs): | ||
group_keys=True, squeeze=False, observed=None, **kwargs): | ||
""" | ||
Group series using mapper (dict or key function, apply given function | ||
to group, return result as series) or by a series of columns. | ||
|
@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, | |
squeeze : boolean, default False | ||
reduce the dimensionality of the return type if possible, | ||
otherwise return a consistent type | ||
observed : boolean, default None | ||
if True: only show observed values for categorical groupers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. capital If (below as well) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, can you start this explanation with noting this keyword is only when grouping by categorical values? |
||
if False: show all values for categorical groupers. | ||
if None: if any categorical groupers, show a FutureWarning, | ||
default to False. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no identation for rst formatting |
||
|
||
.. versionadded:: 0.23.0 | ||
|
||
Returns | ||
------- | ||
|
@@ -6665,7 +6672,7 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, | |
axis = self._get_axis_number(axis) | ||
return groupby(self, by=by, axis=axis, level=level, as_index=as_index, | ||
sort=sort, group_keys=group_keys, squeeze=squeeze, | ||
**kwargs) | ||
observed=observed, **kwargs) | ||
|
||
def asfreq(self, freq, method=None, how=None, normalize=False, | ||
fill_value=None): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use "grouper" as terminology in our documentation (except for the
pd.Grouper
object), so I would write "groupby key" or "to group by"also "multipler" -> "multiple"