Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] libcuDF covariance for Series and Groupby #1268

Closed
beckernick opened this issue Mar 22, 2019 · 2 comments
Closed

[FEA] libcuDF covariance for Series and Groupby #1268

beckernick opened this issue Mar 22, 2019 · 2 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Mar 22, 2019

Is your feature request related to a problem? Please describe.
As a cuDF user, I want to calculate the covariance matrix of my dataset. While I am interested in the covariance between two series, for many downstream uses I often actually need the entire covariance matrix rather than just the covariance of two columns. Implementing covariance in libcudf at the gdf_column level will let us create the dataframe level API in the python bindings.

The pandas API docs are here.

Describe the solution you'd like
I'd like to be able to call df.cov() and get a pairwise covariance matrix of my dataset. To do this, we'll need a method that can compute the covariance of two gdf columns.

Describe alternatives you've considered
The alternative is to manually calculate the covariance, most easily in a double for loop.

Additional context

@beckernick beckernick added Needs Triage Need team to review and classify feature request New feature or request labels Mar 22, 2019
@beckernick beckernick changed the title [FEA] DataFrame level covariance [FEA] Series level covariance Mar 22, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 27, 2019
@beckernick beckernick changed the title [FEA] Series level covariance [FEA] libcuDF covariance for Series and Groupby Oct 3, 2019
@beckernick
Copy link
Member Author

beckernick commented Oct 3, 2019

#2719 provided a temporary, Python-based implementation for Series.covariance. Updating this issue to refer to a libcuDF implementation.

rapids-bot bot pushed a commit that referenced this issue Oct 18, 2021
Add sort-groupby covariance and Pearson correlation in libcudf 
Addresses part of #1268 (groupby covariance)
Addresses part of #8691 (groupby Pearson correlation)
depends on PR #9195

For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`)

```
covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof)
Pearson correlation = covariance/ xstddev / ystddev
```

x, y values both should be non-null. 
mean, stddev, count should be calculated on only common non-null values of both columns.

mean, stddev, count of child columns are cached.
One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same).

Unit tests for covariance and pearson correlation added.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/nvdbaranec

URL: #9154
rapids-bot bot pushed a commit that referenced this issue Feb 17, 2022
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268

Related issue: #1268
Related PRs: #9154, #9166, #9492 

Next steps:

- [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice
- [ ] 	Consolidate  both `cov()` and `corr()`
- [ ] Fix #10303
- [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment))
- [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment))

Authors:
  - Mayank Anand (https://github.com/mayankanand007)
  - Michael Wang (https://github.com/isVoid)
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9889
@vyasr
Copy link
Contributor

vyasr commented Jul 11, 2022

This was closed in #9889

@vyasr vyasr closed this as completed Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

3 participants