[FEA] libcuDF covariance for Series and Groupby #1268

beckernick · 2019-03-22T22:22:51Z

Is your feature request related to a problem? Please describe.
As a cuDF user, I want to calculate the covariance matrix of my dataset. While I am interested in the covariance between two series, for many downstream uses I often actually need the entire covariance matrix rather than just the covariance of two columns. Implementing covariance in libcudf at the gdf_column level will let us create the dataframe level API in the python bindings.

The pandas API docs are here.

Describe the solution you'd like
I'd like to be able to call df.cov() and get a pairwise covariance matrix of my dataset. To do this, we'll need a method that can compute the covariance of two gdf columns.

Describe alternatives you've considered
The alternative is to manually calculate the covariance, most easily in a double for loop.

Additional context

The text was updated successfully, but these errors were encountered:

beckernick · 2019-10-03T21:40:40Z

#2719 provided a temporary, Python-based implementation for Series.covariance. Updating this issue to refer to a libcuDF implementation.

Add sort-groupby covariance and Pearson correlation in libcudf Addresses part of #1268 (groupby covariance) Addresses part of #8691 (groupby Pearson correlation) depends on PR #9195 For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`) ``` covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof) Pearson correlation = covariance/ xstddev / ystddev ``` x, y values both should be non-null. mean, stddev, count should be calculated on only common non-null values of both columns. mean, stddev, count of child columns are cached. One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same). Unit tests for covariance and pearson correlation added. Authors: - Karthikeyan (https://github.com/karthikeyann) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec URL: #9154

This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268 Related issue: #1268 Related PRs: #9154, #9166, #9492 Next steps: - [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice - [ ] Consolidate both `cov()` and `corr()` - [ ] Fix #10303 - [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment)) - [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment)) Authors: - Mayank Anand (https://github.com/mayankanand007) - Michael Wang (https://github.com/isVoid) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9889

vyasr · 2022-07-11T23:28:01Z

This was closed in #9889

beckernick added Needs Triage Need team to review and classify feature request New feature or request labels Mar 22, 2019

beckernick changed the title ~~[FEA] DataFrame level covariance~~ [FEA] Series level covariance Mar 22, 2019

kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 27, 2019

beckernick changed the title ~~[FEA] Series level covariance~~ [FEA] libcuDF covariance for Series and Groupby Oct 3, 2019

beckernick added this to the Pandas API Alignment and Coverage milestone Jul 23, 2021

karthikeyann mentioned this issue Oct 4, 2021

Add Covariance, Pearson correlation for sort groupby (libcudf) #9154

Merged

mayankanand007 mentioned this issue Dec 10, 2021

Add covariance for sort groupby (python) #9889

Merged

5 tasks

vyasr closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] libcuDF covariance for Series and Groupby #1268

[FEA] libcuDF covariance for Series and Groupby #1268

beckernick commented Mar 22, 2019 •

edited

Loading

beckernick commented Oct 3, 2019 •

edited

Loading

vyasr commented Jul 11, 2022

[FEA] libcuDF covariance for Series and Groupby #1268

[FEA] libcuDF covariance for Series and Groupby #1268

Comments

beckernick commented Mar 22, 2019 • edited Loading

beckernick commented Oct 3, 2019 • edited Loading

vyasr commented Jul 11, 2022

beckernick commented Mar 22, 2019 •

edited

Loading

beckernick commented Oct 3, 2019 •

edited

Loading