Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement .describe() for DataFrameGroupBy #8179

Merged
merged 14 commits into from
Jun 7, 2021

Conversation

skirui-source
Copy link
Contributor

@skirui-source skirui-source commented May 7, 2021

This PR implements functionality to generate summary statistics for Dataframe.groupby() operation
via .describe() method, similar to Pandas.

>>> import pandas as pd
>>> pdf = pd.DataFrame({"Speed": [380.0, 370.0, 24.0, 26.0], "Score": [50, 30, 90, 80]})
>>> pdf
   Speed  Score
0  380.0     50
1  370.0     30
2   24.0     90
3   26.0     80
>>> pdf.groupby('Score').describe()
                                                    Speed                                              
      count   mean std    min    25%    50%    75%    max
Score                                                    
30      1.0  370.0 NaN  370.0  370.0  370.0  370.0  370.0
50      1.0  380.0 NaN  380.0  380.0  380.0  380.0  380.0
80      1.0   26.0 NaN   26.0   26.0   26.0   26.0   26.0
90      1.0   24.0 NaN   24.0   24.0   24.0   24.0   24.0


>>> import cudf
>>> gdf = cudf.from_pandas(pdf)
>>> gdf.groupby('Score').describe()
       count   mean   std    min    25%    50%    75%    max
Score                                                       
30         1  370.0  <NA>  370.0  370.0  370.0  370.0  370.0
50         1  380.0  <NA>  380.0  380.0  380.0  380.0  380.0
80         1   26.0  <NA>   26.0   26.0   26.0   26.0   26.0
90         1   24.0  <NA>   24.0   24.0   24.0   24.0   24.0

Fixes: #7990

@skirui-source skirui-source added feature request New feature or request Python Affects Python cuDF API. labels May 7, 2021
@skirui-source skirui-source self-assigned this May 7, 2021
@codecov
Copy link

codecov bot commented May 7, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@7231e3b). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 9eb815b differs from pull request most recent head 571ad90. Consider uploading reports for the commit 571ad90 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8179   +/-   ##
===============================================
  Coverage                ?   82.84%           
===============================================
  Files                   ?      109           
  Lines                   ?    17913           
  Branches                ?        0           
===============================================
  Hits                    ?    14840           
  Misses                  ?     3073           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7231e3b...571ad90. Read the comment docs.

@skirui-source skirui-source marked this pull request as ready for review May 12, 2021 15:39
@skirui-source skirui-source requested a review from a team as a code owner May 12, 2021 15:39
@skirui-source skirui-source added the non-breaking Non-breaking change label May 12, 2021
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests for this?

Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns regarding the use (and implementation) of assert_groupby_results_equal.

python/cudf/cudf/tests/test_groupby.py Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shwina shwina changed the base branch from branch-21.06 to branch-21.08 June 1, 2021 21:11
@shwina
Copy link
Contributor

shwina commented Jun 1, 2021

@gpucibot merge

@skirui-source
Copy link
Contributor Author

re-run tests

@rapids-bot rapids-bot bot merged commit 92ed5b3 into rapidsai:branch-21.08 Jun 7, 2021
@skirui-source skirui-source deleted the describe branch October 19, 2021 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] .describe() after DataFrameGroupBy
5 participants