-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COMPAT: Emit warning when groupby by a tuple #18731
COMPAT: Emit warning when groupby by a tuple #18731
Conversation
cc @toobaz |
Codecov Report
@@ Coverage Diff @@
## master #18731 +/- ##
==========================================
- Coverage 91.64% 91.62% -0.02%
==========================================
Files 154 154
Lines 51401 51408 +7
==========================================
- Hits 47106 47104 -2
- Misses 4295 4304 +9
Continue to review full report at Codecov.
|
pandas/core/groupby.py
Outdated
@@ -2850,7 +2850,15 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True, | |||
elif isinstance(key, BaseGrouper): | |||
return key, [], obj | |||
|
|||
# Everything which is not a list is a key (including tuples): | |||
tuple_as_list = isinstance(key, tuple) and key not in obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only disadvantage I see with this approach is that
pd.DataFrame(1, index=range(3), columns=pd.MultiIndex.from_product([[1, 2], [3,4]])).groupby((7, 8)).mean()
will raise KeyError: 7
while KeyError: (7,8)
would be more correct. Do you think
isinstance(key, tuple) and key not in obj and set(key).issubset(obj)
is too expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change would be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Except that, as @jorisvandenbossche reminded below, set(key)
could contains non-hashable objects, so this possibility should be catched)
pandas/tests/groupby/test_groupby.py
Outdated
with tm.assert_produces_warning(FutureWarning) as w: | ||
df[['a', 'b', 'c']].groupby(('a', 'b')).c.mean() | ||
|
||
assert "Interpreting tuple 'by' as a list" in str(w[0].message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example from the docs (see top post #18314 (comment) for the code to reproduce) is still failing with this branch.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -202,6 +202,9 @@ Deprecations | |||
- ``Series.from_array`` and ``SparseSeries.from_array`` are deprecated. Use the normal constructor ``Series(..)`` and ``SparseSeries(..)`` instead (:issue:`18213`). | |||
- ``DataFrame.as_matrix`` is deprecated. Use ``DataFrame.values`` instead (:issue:`18458`). | |||
- ``Series.asobject``, ``DatetimeIndex.asobject``, ``PeriodIndex.asobject`` and ``TimeDeltaIndex.asobject`` have been deprecated. Use ``.astype(object)`` instead (:issue:`18572`) | |||
- Grouping by a tuple of keys now emits a ``FutureWarning`` and is deprecated. | |||
In the future, a tuple passed to ``'by'`` will always refer to a single key | |||
that is the actual tuple, instead of treating the tuple as multiple keys (:issue:`18314`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention you can simply replace the tuple with a list
pandas/core/groupby.py
Outdated
msg = ("Interpreting tuple 'by' as a list of keys, rather than " | ||
"a single key. Use 'by={!r}' instead of 'by={!r}'. In the " | ||
"future, a tuple will always mean a single key.".format( | ||
list(key), key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the key can contain a long array or column, so not sure it is a good idea to format it like this into the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought NumPy's short repr kicked in sooner that it does. I'll fix this
Hello @TomAugspurger! Thanks for updating the PR.
Comment last updated on December 18, 2017 at 12:53 Hours UTC |
pandas/core/groupby.py
Outdated
all_hashable = is_tuple and all(is_hashable(x) for x in key) | ||
|
||
if is_tuple: | ||
if not all_hashable or key not in obj: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lost. Why do you check that elements are not hashable? I would have done instead
if all_hashable and key not in obj and set(key).issubset(obj):
or (if we want to account for the to-be-deprecated possibility to index with missing keys):
if all_hashable and key not in obj and set(key) & (obj):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better - performance-wise:
if key not in obj and all(is_hashable(x) for x in key) and set(key).issubset(obj):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the case where you're grouping by non-hashable arrays like in #18314 (comment)
In that case, don't we know that they're certainly relying on groupby((a, b))
to be groupby([a, b])
, so we want to warn and listify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do still need to handle your KeyError example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see what you're doing now. Yes, that's probably better, and will make handling the KeyError easier.
Huh about the example: df = pd.DataFrame(1, index=range(3), columns=pd.MultiIndex.from_product([[1, 2], [3,4]]))
df.groupby((7, 8)).mean() On master that gives me Out[4]:
1 2
3 4 3 4
7 1 1 1 1
8 1 1 1 1 Is that correct? That seems like it should throw a KeyError, right? Opened #18798 for that. |
OK, updated to use your suggestion @toobaz, with a slight modification so that we warn when either
|
Yeah, I think that's perfect (I had forgot case 2.). Two small comments:
|
Thanks, fixed. Should be good to go hopefully. |
lgtm. needs a rebase to fix conflict. merge on green. |
Closes #18314