Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Groupby apply on a series returns incorrect values for the key column #8898

Closed
beckernick opened this issue Jul 29, 2021 · 0 comments · Fixed by #9016
Closed

[BUG] Groupby apply on a series returns incorrect values for the key column #8898

beckernick opened this issue Jul 29, 2021 · 0 comments · Fixed by #9016
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

Groupby.apply on a specific column does not return correct keys in the output, despite the value column being return in the original order with correct results.

import cudf
import numpy as np

np.random.seed(12)

nrows = 1000000
nkeys = 100

keycol = np.random.choice(range(nkeys), nrows)

df = cudf.DataFrame({
    "key": keycol,
    "a": np.random.randint(0, 1000, nrows),
})
pdf = df.to_pandas()

def mean_minus_deduped_std(x):
    return x.mean() - x.drop_duplicates().std()
​
print(pdf.groupby("key").a.apply(mean_minus_deduped_std))
print(df.groupby("key").a.apply(mean_minus_deduped_std))
​
​
# Consistent output order / results the same
pr = pdf.groupby("key").a.apply(mean_minus_deduped_std)
gr = df.groupby("key").a.apply(mean_minus_deduped_std)
​
np.testing.assert_array_almost_equal(pr.values, gr.values.get())
key
0     208.686182
1     205.867578
2     212.114642
3     203.059034
4     210.104189
         ...    
95    208.742726
96    210.739197
97    210.747197
98    208.736940
99    211.599720
Name: a, Length: 100, dtype: float64
key
75    208.686182
69    205.867578
97    212.114642
95    203.059034
30    210.104189
         ...    
9     208.742726
51    210.739197
40    210.747197
63    208.736940
45    211.599720
Length: 100, dtype: float64

We'd expect to see every key once in the output, but do not.

gr.index.to_series().value_counts()
34    4
69    4
32    4
56    3
75    3
     ..
73    1
65    1
9     1
86    1
57    1
Name: key, Length: 64, dtype: int32
!conda list | grep "cudf\|pandas\|numpy\|arrow"
arrow-cpp                 4.0.1           py38hf0991f3_4_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudf                      21.08.00a210723 cuda_11.2_py38_ga69a8a43b5_324    rapidsai-nightly
cudf_kafka                21.08.00a210723 py38_ga69a8a43b5_324    rapidsai-nightly
dask-cudf                 21.08.00a210723 py38_ga69a8a43b5_324    rapidsai-nightly
geopandas                 0.9.0              pyhd8ed1ab_1    conda-forge
geopandas-base            0.9.0              pyhd8ed1ab_1    conda-forge
libcudf                   21.08.00a210723 cuda11.2_ga69a8a43b5_324    rapidsai-nightly
libcudf_kafka             21.08.00a210723 ga69a8a43b5_324    rapidsai-nightly
numpy                     1.21.1           py38h9894fe3_0    conda-forge
pandas                    1.2.5            py38h1abd341_0    conda-forge
pyarrow                   4.0.1           py38hb53058b_4_cuda    conda-forge
@beckernick beckernick added bug Something isn't working Needs Triage Need team to review and classify labels Jul 29, 2021
@beckernick beckernick added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 3, 2021
@charlesbluca charlesbluca self-assigned this Aug 11, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 13, 2021
Closes #8898 

Originally, when returning a Series from a `GroupBy.apply()` operation, we would pass in `self.grouping.keys[offsets[:-1]]` as the index, which was meant to grab each unique group key, assuming that `self.grouping.keys` is sorted. However, because it is not sorted, this just ends up grabbing 5 group keys at random.

Since we are already calling `GroupBy._grouped()` in this operation, we can use the `group_names` returned by that as the index instead, which is what the result of `self.grouping.keys[offsets[:-1]]` would be if `self.grouping.keys` was sorted.

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - Michael Wang (https://github.com/isVoid)

URL: #9016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants