BUG: grouping by tuple of values gives TypeError: unhashable type #18314

jorisvandenbossche · 2017-11-15T21:44:30Z

This example in the docs started failing somewhere since the last commits:

In [1]: business_dates = date_range(start='4/1/2014', end='6/30/2014', freq='B')

In [2]: df = DataFrame(1, index=business_dates, columns=['a', 'b'])

In [3]: df.groupby((df.index.year, df.index.month)).nth([0, 3, -1])
...
TypeError: unhashable type: 'Int64Index'

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-11-15T21:47:54Z

Ah, this was deliberately changed by #18249 cc @toobaz @jreback

It seems it also affected dask cc @TomAugspurger

jorisvandenbossche · 2017-11-15T21:50:21Z

If we decide to stick with what #18249 did (not special case Multi-Index, always see tuple as label), then we could consider deprecating this first?

toobaz · 2017-11-15T23:21:54Z

If we decide to stick with what #18249 did (not special case Multi-Index, always see tuple as label)

In general, every time a collection of keys is provided as argument to a pandas method, it should be in a list, not in a tuple, because otherwise we have ambiguity (and not just with MultiIndex). And maybe this should be documented more clearly in the docs (I see "list of labels or tuples" here, but only talking about indexing). But indeed, the groubpy docstring (before my intervention) mentioned a generic "iterable" but then specified "list of strs" (which I changed to "list of labels").

But then, in this specific case, the tuple does not contain keys, but rather full series, and in principle we might want to allow this (although at a cost, because we would need to distinguish valid - e.g. hashable - keys from ndarray-like objects). Indeed my PR had not considered the possibility to pass more than one series, which is not mentioned in the docstring, but which we certainly don't want to drop. The question is whether we want to allow passing them in a tuple.

TomAugspurger · 2017-11-16T01:21:22Z

Yes, this did affect dask, though their use case was a bit strange since the user-provided ddf.groupby(['a', 'b']) when through a function that takes *args, so the list became a tuple.

How hard would it be to make df.groupby(('a', 'b')) emit a future warning for a release, and then making it always be a key in the next?

toobaz · 2017-11-16T08:20:17Z

How hard would it be to make df.groupby(('a', 'b')) emit a future warning for a release

Not hard... the warning would affect also grouping by a legitimate tuple label (contained in a flat index), but this was not supported until my change, so it shouldn't annoy too many people. Shall I proceed?

TomAugspurger · 2017-11-16T12:25:28Z

Not hard... the warning would affect also grouping by a legitimate tuple label

What are your thoughts on these cases?

df = pd.DataFrame({('a', 'b'): [1, 1, 2, 2], 'a': [1, 1, 1, 2], 'b': [1, 2, 2, 2], 'c': [1, 1, 1, 1]})

# Case 1, no ambiguity.  No warning, use the tuple?
df[[('a', 'b'), 'a', 'c']].groupby(('a', 'b')).c.mean()

# Case 2, no ambiguity. Warn and listify the tuple?
df[['a', 'b', 'c']].groupby(('a', 'b')).c.mean()

# Case 3, ambiguous. Probably use the tuple?
df.groupby(('a', 'b')).c.mean()

IIUC, 2 is the case that we "broke", and would like to warn for. I don't know what happened with case 3 before, but I think the correct thing to do is just use the tuple, probably without a warning?

toobaz · 2017-11-16T15:02:13Z

What are your thoughts on these cases?

My plan was actually to not check the content of the columns. So all 3 cases would listify and emit a warning (and case 1 would actually break as before my PR).

This said, while I think in general trying to find labels in indexes and then catching KeyErrors as fallback (and listifying, in this case) is a strategy we should avoid, for a temporary fix it could make sense. In this case, things would go as you suggested above, and case 3 would use the tuple (which is good, since it is the behavior we finally want to keep). Then after one release cycle, we would stop any "listification".

Closes pandas-dev#18314

* COMPAT: Emit warning when groupby by a tuple Closes #18314 * DOC: avoid future warning * Cleanup, test unhashable * PEP8 * Correct KeyError * update * xfail * remove old comments * pep8 * Fixups

jorisvandenbossche added this to the 0.22.0 milestone Nov 15, 2017

jorisvandenbossche added Docs Groupby labels Nov 15, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 11, 2017

COMPAT: Emit warning when groupby by a tuple

209ffce

Closes pandas-dev#18314

TomAugspurger mentioned this issue Dec 11, 2017

COMPAT: Emit warning when groupby by a tuple #18731

Merged

TomAugspurger closed this as completed in #18731 Dec 18, 2017

WillAyd mentioned this issue Nov 20, 2019

Remove Ambiguous Behavior of Tuple as Grouping #29755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: grouping by tuple of values gives TypeError: unhashable type #18314

BUG: grouping by tuple of values gives TypeError: unhashable type #18314

jorisvandenbossche commented Nov 15, 2017

jorisvandenbossche commented Nov 15, 2017

jorisvandenbossche commented Nov 15, 2017

toobaz commented Nov 15, 2017 •

edited

Loading

TomAugspurger commented Nov 16, 2017

toobaz commented Nov 16, 2017

TomAugspurger commented Nov 16, 2017

toobaz commented Nov 16, 2017

BUG: grouping by tuple of values gives TypeError: unhashable type #18314

BUG: grouping by tuple of values gives TypeError: unhashable type #18314

Comments

jorisvandenbossche commented Nov 15, 2017

jorisvandenbossche commented Nov 15, 2017

jorisvandenbossche commented Nov 15, 2017

toobaz commented Nov 15, 2017 • edited Loading

TomAugspurger commented Nov 16, 2017

toobaz commented Nov 16, 2017

TomAugspurger commented Nov 16, 2017

toobaz commented Nov 16, 2017

toobaz commented Nov 15, 2017 •

edited

Loading