[BUG] Groupby apply on a series returns incorrect values for the key column #8898

beckernick · 2021-07-29T15:46:39Z

Groupby.apply on a specific column does not return correct keys in the output, despite the value column being return in the original order with correct results.

import cudf
import numpy as np

np.random.seed(12)

nrows = 1000000
nkeys = 100

keycol = np.random.choice(range(nkeys), nrows)

df = cudf.DataFrame({
    "key": keycol,
    "a": np.random.randint(0, 1000, nrows),
})
pdf = df.to_pandas()

def mean_minus_deduped_std(x):
    return x.mean() - x.drop_duplicates().std()

print(pdf.groupby("key").a.apply(mean_minus_deduped_std))
print(df.groupby("key").a.apply(mean_minus_deduped_std))


# Consistent output order / results the same
pr = pdf.groupby("key").a.apply(mean_minus_deduped_std)
gr = df.groupby("key").a.apply(mean_minus_deduped_std)

np.testing.assert_array_almost_equal(pr.values, gr.values.get())
key
0     208.686182
1     205.867578
2     212.114642
3     203.059034
4     210.104189
         ...    
95    208.742726
96    210.739197
97    210.747197
98    208.736940
99    211.599720
Name: a, Length: 100, dtype: float64
key
75    208.686182
69    205.867578
97    212.114642
95    203.059034
30    210.104189
         ...    
9     208.742726
51    210.739197
40    210.747197
63    208.736940
45    211.599720
Length: 100, dtype: float64

We'd expect to see every key once in the output, but do not.

gr.index.to_series().value_counts()
34    4
69    4
32    4
56    3
75    3
     ..
73    1
65    1
9     1
86    1
57    1
Name: key, Length: 64, dtype: int32

!conda list | grep "cudf\|pandas\|numpy\|arrow"
arrow-cpp                 4.0.1           py38hf0991f3_4_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudf                      21.08.00a210723 cuda_11.2_py38_ga69a8a43b5_324    rapidsai-nightly
cudf_kafka                21.08.00a210723 py38_ga69a8a43b5_324    rapidsai-nightly
dask-cudf                 21.08.00a210723 py38_ga69a8a43b5_324    rapidsai-nightly
geopandas                 0.9.0              pyhd8ed1ab_1    conda-forge
geopandas-base            0.9.0              pyhd8ed1ab_1    conda-forge
libcudf                   21.08.00a210723 cuda11.2_ga69a8a43b5_324    rapidsai-nightly
libcudf_kafka             21.08.00a210723 ga69a8a43b5_324    rapidsai-nightly
numpy                     1.21.1           py38h9894fe3_0    conda-forge
pandas                    1.2.5            py38h1abd341_0    conda-forge
pyarrow                   4.0.1           py38hb53058b_4_cuda    conda-forge

The text was updated successfully, but these errors were encountered:

Closes #8898 Originally, when returning a Series from a `GroupBy.apply()` operation, we would pass in `self.grouping.keys[offsets[:-1]]` as the index, which was meant to grab each unique group key, assuming that `self.grouping.keys` is sorted. However, because it is not sorted, this just ends up grabbing 5 group keys at random. Since we are already calling `GroupBy._grouped()` in this operation, we can use the `group_names` returned by that as the index instead, which is what the result of `self.grouping.keys[offsets[:-1]]` would be if `self.grouping.keys` was sorted. Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Michael Wang (https://github.com/isVoid) URL: #9016

beckernick added bug Something isn't working Needs Triage Need team to review and classify labels Jul 29, 2021

beckernick mentioned this issue Jul 29, 2021

[BUG] Groupby apply on a series does not retain series name #8899

Closed

beckernick added this to the Pandas API Alignment and Coverage milestone Jul 29, 2021

beckernick added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 3, 2021

charlesbluca self-assigned this Aug 11, 2021

charlesbluca mentioned this issue Aug 11, 2021

Use correct index when returning Series from GroupBy.apply() #9016

Merged

rapids-bot bot closed this as completed in #9016 Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Groupby apply on a series returns incorrect values for the key column #8898

[BUG] Groupby apply on a series returns incorrect values for the key column #8898

beckernick commented Jul 29, 2021

[BUG] Groupby apply on a series returns incorrect values for the key column #8898

[BUG] Groupby apply on a series returns incorrect values for the key column #8898

Comments

beckernick commented Jul 29, 2021