Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: categorical grouping will no longer return the cartesian product #20583

Merged
merged 6 commits into from
May 1, 2018

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 2, 2018

closes #14942
closes #15217
closes #17594
closes #8869

xref #8138

@jreback jreback added Groupby API Design Categorical Categorical Data Type labels Apr 2, 2018
@jreback jreback added this to the 0.23.0 milestone Apr 2, 2018
@jreback
Copy link
Contributor Author

jreback commented Apr 2, 2018

this makes categorical groupers work like other groupers and should make things more performant and intuitive. It is somewhat walking back a change from when we first introduced categorical groupers. But the information is preserved (meaning the categories are still there), just removes the automatic re-indexing (which was causing memory to blow up).

still need some increased test coverage. note the first commit actually has almost all of the changes, the next are just cleaning up tests.

@TomAugspurger
Copy link
Contributor

Just making sure, this affects grouping by a single categorical as well?

@jreback
Copy link
Contributor Author

jreback commented Apr 2, 2018

yes this affects just a single categorical column also

@codecov
Copy link

codecov bot commented Apr 2, 2018

Codecov Report

Merging #20583 into master will increase coverage by <.01%.
The diff coverage is 98.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20583      +/-   ##
==========================================
+ Coverage   91.78%   91.79%   +<.01%     
==========================================
  Files         153      153              
  Lines       49341    49371      +30     
==========================================
+ Hits        45287    45319      +32     
+ Misses       4054     4052       -2
Flag Coverage Δ
#multiple 90.18% <98.07%> (+0.01%) ⬆️
#single 41.92% <5.76%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.67% <100%> (+0.05%) ⬆️
pandas/core/groupby/groupby.py 92.62% <100%> (+0.07%) ⬆️
pandas/core/indexes/category.py 97.03% <100%> (ø) ⬆️
pandas/core/reshape/pivot.py 96.97% <87.5%> (ø) ⬆️
pandas/util/testing.py 84.59% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28edd06...bdf7525. Read the comment docs.

@jreback jreback force-pushed the cats branch 2 times, most recently from d53e6f6 to 582da12 Compare April 9, 2018 15:05
@jreback jreback changed the title WIP/API: categorical grouping will no longer return the cartesian product API: categorical grouping will no longer return the cartesian product Apr 9, 2018
@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2018

so this is all ready to go if any comments. @TomAugspurger @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

yes this affects just a single categorical column also

That doesn't seem to be the case?

With this branch, and the example of the whatsnew docs:

In [14]: df.groupby(['A', 'B']).sum()
Out[14]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4

In [15]: df.groupby('A').sum()
Out[15]: 
   values
A        
a       3
b       7
z       0               <---------- now unobserved category is included

Further a general comment (will try to do more detailed review later this week): I am not sure we can just change this. First, it is a API breaking change in several places, eg also pivot_table*. And second, I think the current behaviour can actually useful in certain cases and it would be nice to have a way to keep this behaviour.
I know this is very ugly, but it would be worth to have a keyword for this in groupby ?

* but I agree we should look into it and try to make this more consistent. As, for example, pivot does not seem to include unobserved categories (already currently on master), while pivot_table does include them but apparently only for the index and not columns? (try df.pivot_table('values', 'A', 'B') with the example of whatsnew). On the other hand, value_counts does include them (and I think rightly so), but that also introduces an inconsistency with groupby.

@jreback
Copy link
Contributor Author

jreback commented Apr 12, 2018

So I could also change this for a single grouper. That breaks a couple of tests. I am inclined to do this actually as then it makes the multi and single case consistent.

We cannot support a full cartesian product for multi-groupers as this will blow up memory and kill performance. So either we go for:

  • an option to turn this on/off.
  • differing behavior for single vs multi
  • change single grouping behavior as well

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 12, 2018 via email



@pytest.mark.xfail(reason="failing with observed")
def test_observed_failing(observed):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if anyone wants to take a crack at this test. Its the only one that uses an IntervalIndex as its category, though this might be a red herring.

cc @WillAyd @toobaz @jschendel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't been able to follow all the way through yet but I think this is a regression with the below method:

def decons_group_index(comp_labels, shape):

Using the failing test example this returns [array(1, 1, 0, 0], ... but on master that same object would return [array(1, 1, 2, 2], .... As a result, I think the reconstructed group labels are getting swapped in _wrap_agged_blocks and causing the failure.

Will try to find time next day or so to walk through in more detail but sharing in case it helps anyone else reviewing the issue

@jreback jreback force-pushed the cats branch 2 times, most recently from 422a6be to 7cd56cd Compare April 26, 2018 00:33
@jreback
Copy link
Contributor Author

jreback commented Apr 26, 2018

this is read for a look

@TomAugspurger @jorisvandenbossche

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Main thing is the change to pivot_table.

It'd be good to document whether the unobserved categories are present in the resulting index's type.

i.e. is

In [7]: pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b'])).count().index.dtype
Out[7]: CategoricalDtype(categories=['a', 'b'], ordered=False)

or

Out[7]: CategoricalDtype(categories=['a'], ordered=False)

This could go in the whatsnew and either groupby.rst or categorical.rst probably. I think either behavior is fine.

Still going through the test changes.


.. ipython:: python

.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have an example showing the future warning? So just this without observed=False, and an :okwarning: directive. Then you can say "use observed=False to retain the previous behavior and silence the warning.

df.groupby(['A', 'B', 'C'], observed=False).count()


New Behavior (show only observed values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New -> Future?

msg = ("pass observed=True to ensure that a "
"categorical grouper only returns the "
"observed groupers, or\n"
"observed=False to return NA for non-observed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things like 'count' return 0 for unobserved. You could rephrase as "observed=False to include unobserved categories."

@@ -79,7 +79,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
pass
values = list(values)

grouped = data.groupby(keys)
grouped = data.groupby(keys, observed=dropna)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to think about this a bit. Are we all OK with "overloading" dropna to serve two purposes? I think it's ok...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but that's the meaning of the dropna now here anyhow

@@ -241,10 +241,13 @@ def _all_key(key):
return (key, margins_name) + ('',) * (len(cols) - 1)

if len(rows) > 0:
margin = data[rows + values].groupby(rows).agg(aggfunc)
margin = data[rows + values].groupby(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observed=True for these changes are all backwards compatible?

@@ -488,12 +488,12 @@ def test_agg_structs_series(structure, expected):


@pytest.mark.xfail(reason="GH-18869: agg func not called on empty groups.")
def test_agg_category_nansum():
def test_agg_category_nansum(observed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be xfailed for observed=True? I think it may be the right answer (not sure though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed so that observed=True is XPASS (not sure how to xfail on a particular fixture value)

@@ -4,6 +4,11 @@
from pandas.util import testing as tm


@pytest.fixture(params=[True, False])
def observed(request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring would be nice.

names=['A', 'B'])
expected = DataFrame(
{'values': [1, 2, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so those changes in pivot are API breaking? I'd prefer that this goes through the same deprecation cycle. On the other hand, this would mean an additional keyword, so that we can do the deprecation cycle properly...

Do we think pivot(..., dropna=True, observed=False) (i.e. the current default) is a useful combination? I could see it being desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observed works like dropna, so I am not sure we need this. I can make the default work like the existing I think (which is effectively dropna=False)

def test_groupby_sort_categorical():
def test_sort():

# http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line too long? https://stackoverflow.com/q/23814368/1889400 is a short link to the same q.

@TomAugspurger
Copy link
Contributor

http://pandas-docs.github.io/pandas-docs-travis/categorical.html#operations would be a good place to show observed=False.

About pivot_table, I always forget what dropna does... Does it control dropping columns / rows that are all NA before or after aggregating?

@jorisvandenbossche
Copy link
Member

I will try to give this a more detailed look tomorrow.

But general comment: now that we have the keyword to specify the behaviour (for now with observed=False for back compat), I am not really sure that we should change the default in the future.

It's difficult to judge, as I personally don't run much in such situations, but my gut feeling says that the original issue that motivated the change (the combinatorial explosion with multiple categorical groupers) is not the majority usage pattern of categoricals. And if you don't want to include unobserved categoricals in your analyses in general (groupby, pivot, value_counts, plotting, ..), you always have the easy functionality of remove_unused_categories.

cc @jankatins

@jreback
Copy link
Contributor Author

jreback commented Apr 26, 2018

@TomAugspurger the pivot was a red herring, was not passing things thru. and dropna (in pivot) is exactly equivalent of the observed kwarg (to groupby). So maybe should just rename observed -> dropna, and would be consistent across other functions (value_counts) as well.

@jorisvandenbossche I disagree. You almost never want a cartesian product of all of the groupers. It can easily blow you up and shouldn't be the default. (its also easy to create if you need). Not this has nothing to do with the actual categories that are returned, they are in BOTH cases indicated on the dtype of the level of that index, (observed or not), its the groupers that are the issue.

@toobaz
Copy link
Member

toobaz commented Apr 26, 2018

@jorisvandenbossche I also think the default should change. Categoricals are mostly an implementation detail, and should behave as similarly as possible as ordinary Series.

@jorisvandenbossche
Copy link
Member

You almost never want a cartesian product of all of the groupers. It can easily blow you up and shouldn't be the default.

But in most cases you have no cartesian product, you only have a single categorical key

It can easily blow you up and shouldn't be the default. (its also easy to create if you need).

It's also easy to get the version with only the observed ones (certainly now there is the keyword), so that is not really an argument IMO

Not this has nothing to do with the actual categories that are returned, they are in BOTH cases indicated on the dtype of the level of that index

Yes, and that is certainly a good thing. But for me it is still the question what we want the visible output to be.
For example a pivot table is often used as a kind of a summary table. In such a case, I often do want to know that a certain category I care about (otherwise it would not be in the categories) has no values.
And it is also about consistency: value_counts does include the unobserved one (and rightly so, IMO).

@jankatins
Copy link
Contributor

My original motivation to work on categoricals were stuff like lickert scales ("completly agree ... completly disagree", 5 to 7 values). For that seing unobserved categoricals in group bys are a good thing.

Since then a lot of changes have been made to make categoricals usefull in other situations (like as a memory efficient string replacement or to analyse genes?).

This change feels like it is a change to make working with the latter easier but will make the former harder.

@toobaz
Copy link
Member

toobaz commented Apr 27, 2018

And it is also about consistency: value_counts does include the unobserved one (and rightly so, IMO).

I gave it as granted that if we change groupby we will also change value_counts - otherwise I agree it doesn't make any sense.

This change feels like it is a change to make working with the latter easier but will make the former harder.

True... I just think your use case is more rare (but I have no data to support this statement, other than personal experience). Anyway, with the new argument both things will be pretty easy anyway, it is mainly a matter of which use case should require awareness from the user.

@jreback
Copy link
Contributor Author

jreback commented May 1, 2018

I changed the default back to observed=None to preserve options. I think we can discuss wether / if to change this in the future. But for now allows folks to deal with grouping in a better way now.

@jankatins
Copy link
Contributor

I don't see why it is should be good that my code breaks when appending the new data to the old data.

If you have a defined order, where would you put the new categorical into the orders? E.g. good - middle - bad, where would you put 'extreme' -> if you care about order, then you also care about not adding new stuff

I mostly work on the 5-10 most frequent types, and in those cases I will always use observed=False

This usecase might also be satisfied by a collapse or lump method (e.g. see here: http://r4ds.had.co.nz/factors.html#modifying-factor-levels)

@TomAugspurger
Copy link
Contributor

FYI, discussion for a new dict-encoded / interned values type can go at #20899 so we don't lose it.

Reviewing this PR one more time, but I think people are all on board with having the default be observed=False (back compat) for now?

@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our

.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/

.. _whatsnew_0230.enhancements.categorical_grouping:

Categorical Groupers has gained an observed keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has -> have? Because "categorical Groupers" is plural right?

@@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort):
categories in the original order.
"""

# we only care about observed values
if observed:
unique_codes = unique1d(self.codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't thought this through, but can this if block be replaced with self.remove_unused_cateogories()._codes_for_groupby(sort=sort, observed=False)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you actually need the uniques

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I would like to ask, again, can you please only add new commits when updating for reviews instead of amending different parts to different previous commits?

I wanted to start complaining that you didn't update for the bug I pointed out in _codes_for_groupby because I didn't see it in the new commits, but I see you actually fixed it.
But this way it is really hard to see that you updated it and added tests for it, and to see what you actually changed in that function compared to the previous time I reviewed.

Categorical Groupers has gained an observed keyword
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To repeat my previous comment: I would not use the "cartesian product" to introduce this. The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this on purpose, this is more correct.

Copy link
Contributor

@TomAugspurger TomAugspurger May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cartesian product" really only makes sense in the 2 or more case, right? But you say "1 or more" above. I would phrase it as

"Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categories, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in high memory usage."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the explanation of Tom is exactly what I meant.

@jreback I have no problem at all with that you don't agree with a comment (it would be strange otherwise :-)) and thus not update for it, but can you then answer to that comment noting that? Otherwise I cannot know that I should not repeat a comment (or that I shouldn't get annoyed with my comments being ignored :))

Handling of (un)observed Categorical values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use "grouper" as terminology in our documentation (except for the pd.Grouper object), so I would write "groupby key" or "to group by"

also "multipler" -> "multiple"


When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
that are observed groupers (``observed=True``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or only those that are observed groupers" -> "or only the observed categories"


pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count()

The returned dtype of the grouped will *always* include *all* of the catergories that were grouped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catergories -> categories


.. ipython:: python

pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe just create s and cat to avoid repeating this a few times

@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
observed : boolean, default None
if True: only show observed values for categorical groupers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capital If (below as well)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you start this explanation with noting this keyword is only when grouping by categorical values?

if True: only show observed values for categorical groupers.
if False: show all values for categorical groupers.
if None: if any categorical groupers, show a FutureWarning,
default to False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no identation for rst formatting

@@ -2898,14 +2907,16 @@ class Grouping(object):
"""

def __init__(self, index, grouper=None, obj=None, name=None, level=None,
sort=True, in_axis=False):
sort=True, observed=None, in_axis=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have it as False default? (if we want to deprecate in the future, we can then just use None ?)


# TODO(jreback): remove completely
# when observed parameter is defaulted to True
# gh-20583
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this comment be removed for now?

@jorisvandenbossche
Copy link
Member

My remaining comments are mainly doc related, so I am fine with merging this now for 0.23rc, if @jreback does a follow-up PR.

@jreback
Copy link
Contributor Author

jreback commented May 1, 2018

@jorisvandenbossche

I would like to ask, again, can you please only add new commits when updating for reviews instead of amending different parts to different previous commits?

I wanted to start complaining that you didn't update for the bug I pointed out in _codes_for_groupby because I didn't see it in the new commits, but I see you actually fixed it.
But this way it is really hard to see that you updated it and added tests for it, and to see what you actually changed in that function compared to the previous time I reviewed.

and I did push new ones.

@jorisvandenbossche
Copy link
Member

and I did push new ones.

Well, if I look at the diff for only the commits you added the last day (https://github.com/pandas-dev/pandas/pull/20583/files/19c9cf7871847de8f0a8504e9f121ad1460512d0..bdf7525812ca670f9406ab8df333030d36d30947), there is no change in the _codes_for_groupby function, while you did update it according to my comments.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 1, 2018

@jreback opened #20902 for the followup.

Will merge in ~1 hour.

@jreback
Copy link
Contributor Author

jreback commented May 1, 2018

@TomAugspurger if you want to do the RC now, then i'll have to followup on comments later.

@TomAugspurger
Copy link
Contributor

If we're doing this for 0.23 then it should go in the RC I think. I can wait a bit longer if you plan to push more changes.

@jreback
Copy link
Contributor Author

jreback commented May 1, 2018

changes will only be cosmetic so u can merge if u want now

@TomAugspurger
Copy link
Contributor

That's what I thought, thanks.

@TomAugspurger TomAugspurger merged commit b020891 into pandas-dev:master May 1, 2018
@dragoljub
Copy link

Great work! Looking forward to trying this out. :+1

@ammar-nizami
Copy link

I think there is a bug with observed=True. The below statement removed the unobserved categorical values, but the dataframe returned wrong counts.
The categorical values were not sorted and the counts were mismatched with the categories.

df.groupby('categorical_column', observed=True, as_index=False).count()

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 30, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment