Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add min_periods, ddof to groupby covariance, & correlation aggregation #9492

Merged
merged 4 commits into from
Oct 26, 2021

Conversation

karthikeyann
Copy link
Contributor

Addresses part of #8691
Add min_periods and ddof parameters to libcudf groupby covariance and Pearson correlation (python needs this)

@karthikeyann karthikeyann added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Oct 21, 2021
@karthikeyann karthikeyann requested a review from a team as a code owner October 21, 2021 20:02
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 21, 2021
@codecov
Copy link

codecov bot commented Oct 21, 2021

Codecov Report

Merging #9492 (d6ebfb3) into branch-21.12 (ab4bfaa) will decrease coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9492      +/-   ##
================================================
- Coverage         10.79%   10.67%   -0.12%     
================================================
  Files               116      117       +1     
  Lines             18869    19714     +845     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17610     +777     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23de7d0...d6ebfb3. Read the comment docs.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the changes in this PR.

However it's a bit disappointing that the aggregate functor model didn't work well with struct columns. And that we had to add mean, count, and covariance calculation to aggregate_result_functor::operator()<aggregation::CORRELATION> instead of just calling operator()<aggregation::COVARIANCE> and then use cache.get_result() to get the cov, stddev results out of it.

@karthikeyann
Copy link
Contributor Author

we had to add mean, count, and covariance calculation to aggregate_result_functor::operator()aggregation::CORRELATION instead of just calling operator()aggregation::COVARIANCE

This is because of the "non-null values only" requirement for both covariance and correlation. If operator()<aggregation::COVARIANCE> is called, child columns mean & count will be recalculated for child columns with non-identical nulls.
Anyway it shares the same cache. So for identical nulls, it will still be able to use the cache and avoid the repeated computation.

I made changes to cache covariance if already calculated for this struct.

However it's a bit disappointing that the aggregate functor model didn't work well with struct columns

Since python requests correlation for pairs of columns as structs, caching for child column has added benefit of sharing std, mean, count for identical null columns across multiple aggregation requests. if we cache as struct column, this will not be possible.

@karthikeyann karthikeyann requested a review from a team October 26, 2021 03:58
@karthikeyann karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 26, 2021
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c0951ba into rapidsai:branch-21.12 Oct 26, 2021
rapids-bot bot pushed a commit that referenced this pull request Feb 17, 2022
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268

Related issue: #1268
Related PRs: #9154, #9166, #9492 

Next steps:

- [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice
- [ ] 	Consolidate  both `cov()` and `corr()`
- [ ] Fix #10303
- [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment))
- [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment))

Authors:
  - Mayank Anand (https://github.com/mayankanand007)
  - Michael Wang (https://github.com/isVoid)
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants