-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: grouping by tuple of values gives TypeError: unhashable type #18314
Comments
Ah, this was deliberately changed by #18249 cc @toobaz @jreback It seems it also affected dask cc @TomAugspurger |
If we decide to stick with what #18249 did (not special case Multi-Index, always see tuple as label), then we could consider deprecating this first? |
In general, every time a collection of keys is provided as argument to a pandas method, it should be in a list, not in a tuple, because otherwise we have ambiguity (and not just with But then, in this specific case, the tuple does not contain keys, but rather full series, and in principle we might want to allow this (although at a cost, because we would need to distinguish valid - e.g. hashable - keys from ndarray-like objects). Indeed my PR had not considered the possibility to pass more than one series, which is not mentioned in the docstring, but which we certainly don't want to drop. The question is whether we want to allow passing them in a tuple. |
Yes, this did affect dask, though their use case was a bit strange since the user-provided How hard would it be to make |
Not hard... the warning would affect also grouping by a legitimate tuple label (contained in a flat index), but this was not supported until my change, so it shouldn't annoy too many people. Shall I proceed? |
What are your thoughts on these cases? df = pd.DataFrame({('a', 'b'): [1, 1, 2, 2], 'a': [1, 1, 1, 2], 'b': [1, 2, 2, 2], 'c': [1, 1, 1, 1]})
# Case 1, no ambiguity. No warning, use the tuple?
df[[('a', 'b'), 'a', 'c']].groupby(('a', 'b')).c.mean()
# Case 2, no ambiguity. Warn and listify the tuple?
df[['a', 'b', 'c']].groupby(('a', 'b')).c.mean()
# Case 3, ambiguous. Probably use the tuple?
df.groupby(('a', 'b')).c.mean() IIUC, 2 is the case that we "broke", and would like to warn for. I don't know what happened with case 3 before, but I think the correct thing to do is just use the tuple, probably without a warning? |
My plan was actually to not check the content of the columns. So all 3 cases would listify and emit a warning (and case 1 would actually break as before my PR). This said, while I think in general trying to find labels in indexes and then catching |
* COMPAT: Emit warning when groupby by a tuple Closes #18314 * DOC: avoid future warning * Cleanup, test unhashable * PEP8 * Correct KeyError * update * xfail * remove old comments * pep8 * Fixups
This example in the docs started failing somewhere since the last commits:
The text was updated successfully, but these errors were encountered: