add min_periods, ddof to groupby covariance, & correlation aggregation #9492

karthikeyann · 2021-10-21T20:02:16Z

Addresses part of #8691
Add min_periods and ddof parameters to libcudf groupby covariance and Pearson correlation (python needs this)

codecov · 2021-10-21T21:50:29Z

Codecov Report

Merging #9492 (d6ebfb3) into branch-21.12 (ab4bfaa) will decrease coverage by 0.11%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-21.12    #9492      +/-   ##
================================================
- Coverage         10.79%   10.67%   -0.12%     
================================================
  Files               116      117       +1     
  Lines             18869    19714     +845     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17610     +777

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.90% <0.00%> (-1.21%)`	⬇️
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/hdf.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/dlpack.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23de7d0...d6ebfb3. Read the comment docs.

devavret

Approving the changes in this PR.

However it's a bit disappointing that the aggregate functor model didn't work well with struct columns. And that we had to add mean, count, and covariance calculation to aggregate_result_functor::operator()<aggregation::CORRELATION> instead of just calling operator()<aggregation::COVARIANCE> and then use cache.get_result() to get the cov, stddev results out of it.

karthikeyann · 2021-10-22T17:51:49Z

we had to add mean, count, and covariance calculation to aggregate_result_functor::operator()aggregation::CORRELATION instead of just calling operator()aggregation::COVARIANCE

This is because of the "non-null values only" requirement for both covariance and correlation. If operator()<aggregation::COVARIANCE> is called, child columns mean & count will be recalculated for child columns with non-identical nulls.
Anyway it shares the same cache. So for identical nulls, it will still be able to use the cache and avoid the repeated computation.

I made changes to cache covariance if already calculated for this struct.

However it's a bit disappointing that the aggregate functor model didn't work well with struct columns

Since python requests correlation for pairs of columns as structs, caching for child column has added benefit of sharing std, mean, count for identical null columns across multiple aggregation requests. if we cache as struct column, this will not be possible.

karthikeyann · 2021-10-26T17:24:13Z

@gpucibot merge

This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268 Related issue: #1268 Related PRs: #9154, #9166, #9492 Next steps: - [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice - [ ] Consolidate both `cov()` and `corr()` - [ ] Fix #10303 - [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment)) - [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment)) Authors: - Mayank Anand (https://github.com/mayankanand007) - Michael Wang (https://github.com/isVoid) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9889

karthikeyann added 2 commits October 22, 2021 01:28

add min_periods, ddof to covariance, & correlation

40d1ecb

add min_periods, ddof tests to covariance, correlation

b157316

karthikeyann added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Oct 21, 2021

karthikeyann requested a review from a team as a code owner October 21, 2021 20:02

karthikeyann requested review from devavret and jrhemstad October 21, 2021 20:02

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 21, 2021

style fix

9a087cf

devavret approved these changes Oct 22, 2021

View reviewed changes

skip covariance computation if already present in cache

d6ebfb3

karthikeyann requested a review from a team October 26, 2021 03:58

jrhemstad approved these changes Oct 26, 2021

View reviewed changes

karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 26, 2021

rapids-bot bot merged commit c0951ba into rapidsai:branch-21.12 Oct 26, 2021

mayankanand007 mentioned this pull request Dec 10, 2021

Add covariance for sort groupby (python) #9889

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add min_periods, ddof to groupby covariance, & correlation aggregation #9492

add min_periods, ddof to groupby covariance, & correlation aggregation #9492

karthikeyann commented Oct 21, 2021

codecov bot commented Oct 21, 2021 •

edited

Loading

devavret left a comment

karthikeyann commented Oct 22, 2021

karthikeyann commented Oct 26, 2021

add min_periods, ddof to groupby covariance, & correlation aggregation #9492

add min_periods, ddof to groupby covariance, & correlation aggregation #9492

Conversation

karthikeyann commented Oct 21, 2021

codecov bot commented Oct 21, 2021 • edited Loading

Codecov Report

devavret left a comment

Choose a reason for hiding this comment

karthikeyann commented Oct 22, 2021

karthikeyann commented Oct 26, 2021

codecov bot commented Oct 21, 2021 •

edited

Loading