-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add min_periods, ddof to groupby covariance, & correlation aggregation #9492
add min_periods, ddof to groupby covariance, & correlation aggregation #9492
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9492 +/- ##
================================================
- Coverage 10.79% 10.67% -0.12%
================================================
Files 116 117 +1
Lines 18869 19714 +845
================================================
+ Hits 2036 2104 +68
- Misses 16833 17610 +777
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the changes in this PR.
However it's a bit disappointing that the aggregate functor model didn't work well with struct columns. And that we had to add mean, count, and covariance calculation to aggregate_result_functor::operator()<aggregation::CORRELATION>
instead of just calling operator()<aggregation::COVARIANCE>
and then use cache.get_result()
to get the cov, stddev results out of it.
This is because of the "non-null values only" requirement for both covariance and correlation. If I made changes to cache covariance if already calculated for this struct.
Since python requests correlation for pairs of columns as structs, caching for child column has added benefit of sharing std, mean, count for identical null columns across multiple aggregation requests. if we cache as struct column, this will not be possible. |
@gpucibot merge |
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268 Related issue: #1268 Related PRs: #9154, #9166, #9492 Next steps: - [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice - [ ] Consolidate both `cov()` and `corr()` - [ ] Fix #10303 - [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment)) - [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment)) Authors: - Mayank Anand (https://github.com/mayankanand007) - Michael Wang (https://github.com/isVoid) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9889
Addresses part of #8691
Add min_periods and ddof parameters to libcudf groupby covariance and Pearson correlation (python needs this)