Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add dictionary support to libcudf groupby functions #6585

Merged
merged 61 commits into from
Jan 5, 2021

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Oct 22, 2020

Reference #5963 Add dictionary support to groupby.

  • argmax
  • argmin
  • collect
  • count
  • max
  • mean*
  • median
  • min
  • nth element
  • nunique
  • quantile
  • std*
  • sum*
  • var*
  • not supported due to 10.2 compile segfault

@davidwendt davidwendt self-assigned this Oct 22, 2020
@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Oct 22, 2020
@codecov
Copy link

codecov bot commented Oct 23, 2020

Codecov Report

Merging #6585 (6cc32d8) into branch-0.18 (31c0d29) will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #6585      +/-   ##
===============================================
+ Coverage        82.09%   82.11%   +0.02%     
===============================================
  Files               97       97              
  Lines            16474    16477       +3     
===============================================
+ Hits             13524    13530       +6     
+ Misses            2950     2947       -3     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/hash_vocab_utils.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 91.48% <0.00%> (+4.25%) ⬆️
python/cudf/cudf/utils/gpu_utils.py 58.53% <0.00%> (+4.87%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31c0d29...6cc32d8. Read the comment docs.

@jrhemstad
Copy link
Contributor

I realized that this PR should wait until #6392 as it will likely have a lot of conflicts.

@davidwendt
Copy link
Contributor Author

I cannot get around the 10.2 ptxas compile segfault as documented here https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3186317
The code logic that seems to cause the error is isolated in the aggregation sum specialization for dictionary type. This means we cannot support aggregation-sum-groupby for dictionary columns as values. And aggregation-sum is also required in order to support mean, std, and var aggregations. Using a dictionary column as keys for groupby work fine.

The PR includes logic to throw an exception if a dictionary column is used as values with one of these aggregation types. The code is isolated around a CUDA==10.2 equivalent compile directive so technically these aggregations work if compiled with 10.1 or 11.0 but the gtests have been commented out.

@davidwendt
Copy link
Contributor Author

rerun tests

1 similar comment
@mike-wendt
Copy link
Contributor

rerun tests

cpp/tests/groupby/group_sum_test.cpp Outdated Show resolved Hide resolved
cpp/tests/groupby/group_var_test.cpp Outdated Show resolved Hide resolved
cpp/tests/groupby/group_sum_test.cpp Show resolved Hide resolved
cpp/tests/groupby/group_nth_element_test.cpp Outdated Show resolved Hide resolved
cpp/tests/groupby/group_mean_test.cpp Outdated Show resolved Hide resolved
cpp/tests/groupby/group_mean_test.cpp Show resolved Hide resolved
cpp/src/groupby/sort/group_std.cu Show resolved Hide resolved
@harrism harrism added 6 - Okay to Auto-Merge 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 5, 2021
@rapids-bot rapids-bot bot merged commit 6828e2c into rapidsai:branch-0.18 Jan 5, 2021
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn. I missed it by a minute.


// prevent divide by zero error
if (group_size == 0 or group_size - ddof <= 0) return 0.0;

ResultType mean = d_means.element<ResultType>(group_idx);
ResultType mean = d_means[group_idx];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of the reverse direction of what I've been doing. column_device_view's element() accessor is more helpful than simple subscript. e.g. in fixed point. element() will return the value with the scale applied. And that scale is stored once in the column.

Any particular reason for this change?

@davidwendt davidwendt deleted the dictionary-groupby branch January 11, 2021 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants