-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix some cases of groupby(...).transform with dropna=True #45953
Changes from 12 commits
b58536c
d9329cd
726a1b8
2f32be1
5e92af4
3919c51
31240e1
5d4e1d3
d3cbf50
c60da74
f4b8023
80b6629
6ecd693
da4201a
e589e11
72a10ac
fe0fbf4
1c015f4
90ba088
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,39 @@ Styler | |
|
||
- Fixed bug in :class:`CSSToExcelConverter` leading to ``TypeError`` when border color provided without border style for ``xlsxwriter`` engine (:issue:`42276`) | ||
|
||
.. _whatsnew_150.notable_bug_fixes.groupby_transform_dropna: | ||
|
||
Using ``dropna=True`` with ``groupby`` transformers | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
A transformer is an operation whose result has the same size as its input. When the | ||
result is a :class:`DataFrame` or :class:`Series`, it is also required that the | ||
index of the result is equal to that of the input. In pandas 1.4, using | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "is equal" -> "matches"? I think there was a recent issue with something like df.groupby(foo).transform(DataFrame.sort_index) where there was some question as to whether the index should be retained. Did that get sorted out? (Or is it not relevant here?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It didn't yet get sorted out. In fact, I went to improve the docs because of that issue, and found that I agree matches is better than equal here - and plan to clarify that |
||
:meth:`.DataFrameGroupBy.transform` or :meth:`.SeriesGroupBy.transform` with null | ||
values in the groups and ``dropna=True`` gave incorrect results. Demonstrated by the | ||
examples below, the incorrect results either contained incorrect values, or the result | ||
did not have the same index as the input. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]}) | ||
|
||
*Old behavior*: | ||
|
||
.. code-block:: ipython | ||
|
||
In [3]: df.groupby('a', dropna=True).transform(lambda x: x) | ||
Out[3]: | ||
b | ||
0 2 | ||
1 3 | ||
|
||
*New behavior*: | ||
|
||
.. ipython:: python | ||
|
||
df.groupby('a', dropna=True).transform(lambda x: x) | ||
|
||
.. _whatsnew_150.notable_bug_fixes.notable_bug_fix2: | ||
|
||
notable_bug_fix2 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -783,6 +783,30 @@ def indices(self) -> dict[Hashable, npt.NDArray[np.intp]]: | |
keys = [ping.group_index for ping in self.groupings] | ||
return get_indexer_dict(codes_list, keys) | ||
|
||
@final | ||
def result_ilocs(self) -> npt.NDArray[np.intp]: | ||
""" | ||
Get the original integer locations of result_index in the input. | ||
""" | ||
# Original indices are where group_index would go via sorting. | ||
# But when dropna is true, we need to remove null values whilst accounting for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "whilst" may be tough for non-native speakers There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks - changed to while |
||
# any gaps that then occur because of them. | ||
group_index = get_group_index(self.codes, self.shape, sort=False, xnull=True) | ||
|
||
if self.has_dropped_na: | ||
mask = np.where(group_index >= 0) | ||
# Count how many gaps are caused by previous null values for each position | ||
null_gaps = np.cumsum(group_index == -1)[mask] | ||
group_index = group_index[mask] | ||
|
||
result = get_group_index_sorter(group_index, self.ngroups) | ||
|
||
if self.has_dropped_na: | ||
# Shift by the number of prior null gaps | ||
result += np.take(null_gaps, result) | ||
|
||
return result | ||
|
||
@final | ||
@property | ||
def codes(self) -> list[np.ndarray]: | ||
|
@@ -825,6 +849,14 @@ def is_monotonic(self) -> bool: | |
# return if my group orderings are monotonic | ||
return Index(self.group_info[0]).is_monotonic_increasing | ||
|
||
@final | ||
@cache_readonly | ||
def has_dropped_na(self) -> bool: | ||
""" | ||
Whether grouper has null value(s) that are dropped. | ||
""" | ||
return bool((self.group_info[0] < 0).any()) | ||
|
||
@cache_readonly | ||
def group_info(self) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]: | ||
comp_ids, obs_group_ids = self._get_compressed_codes() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,8 +18,6 @@ | |
import numpy as np | ||
import pytest | ||
|
||
from pandas.compat import pa_version_under2p0 | ||
|
||
import pandas as pd | ||
from pandas.core.arrays import ArrowStringArray | ||
from pandas.core.arrays.string_ import StringDtype | ||
|
@@ -193,10 +191,6 @@ class TestPrinting(base.BasePrintingTests): | |
|
||
class TestGroupBy(base.BaseGroupbyTests): | ||
def test_groupby_extension_transform(self, data_for_grouping, request): | ||
if data_for_grouping.dtype.storage == "pyarrow" and pa_version_under2p0: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh surprised that this solves it. neat. |
||
# failure observed in 1.0.1, not in 2.0 or later | ||
mark = pytest.mark.xfail(reason="pyarrow raises in self._data[item]") | ||
request.node.add_marker(mark) | ||
super().test_groupby_extension_transform(data_for_grouping) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we use "transformer" elsewhere? Could use "transform"/"transforms"? Is there a risk of confusion for ML people for whom "transformer" means something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do mention transformer once in the docs, but agreed with sticking to transform here.