-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] libcuDF covariance for Series and Groupby #1268
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Milestone
Comments
beckernick
added
Needs Triage
Need team to review and classify
feature request
New feature or request
labels
Mar 22, 2019
beckernick
changed the title
[FEA] DataFrame level covariance
[FEA] Series level covariance
Mar 22, 2019
kkraus14
added
Python
Affects Python cuDF API.
libcudf
Affects libcudf (C++/CUDA) code.
and removed
Needs Triage
Need team to review and classify
labels
Mar 27, 2019
beckernick
changed the title
[FEA] Series level covariance
[FEA] libcuDF covariance for Series and Groupby
Oct 3, 2019
#2719 provided a temporary, Python-based implementation for Series.covariance. Updating this issue to refer to a libcuDF implementation. |
rapids-bot bot
pushed a commit
that referenced
this issue
Oct 18, 2021
Add sort-groupby covariance and Pearson correlation in libcudf Addresses part of #1268 (groupby covariance) Addresses part of #8691 (groupby Pearson correlation) depends on PR #9195 For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`) ``` covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof) Pearson correlation = covariance/ xstddev / ystddev ``` x, y values both should be non-null. mean, stddev, count should be calculated on only common non-null values of both columns. mean, stddev, count of child columns are cached. One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same). Unit tests for covariance and pearson correlation added. Authors: - Karthikeyan (https://github.com/karthikeyann) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec URL: #9154
5 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Feb 17, 2022
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268 Related issue: #1268 Related PRs: #9154, #9166, #9492 Next steps: - [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice - [ ] Consolidate both `cov()` and `corr()` - [ ] Fix #10303 - [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment)) - [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment)) Authors: - Mayank Anand (https://github.com/mayankanand007) - Michael Wang (https://github.com/isVoid) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9889
This was closed in #9889 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Is your feature request related to a problem? Please describe.
As a cuDF user, I want to calculate the covariance matrix of my dataset. While I am interested in the covariance between two series, for many downstream uses I often actually need the entire covariance matrix rather than just the covariance of two columns. Implementing covariance in libcudf at the gdf_column level will let us create the dataframe level API in the python bindings.
The pandas API docs are here.
Describe the solution you'd like
I'd like to be able to call
df.cov()
and get a pairwise covariance matrix of my dataset. To do this, we'll need a method that can compute the covariance of two gdf columns.Describe alternatives you've considered
The alternative is to manually calculate the covariance, most easily in a double for loop.
Additional context
The text was updated successfully, but these errors were encountered: