[REVIEW] Add dictionary support to libcudf groupby functions #6585

davidwendt · 2020-10-22T21:44:47Z

Reference #5963 Add dictionary support to groupby.

not supported due to 10.2 compile segfault

codecov · 2020-10-23T00:04:11Z

Codecov Report

Merging #6585 (6cc32d8) into branch-0.18 (31c0d29) will increase coverage by 0.02%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.18    #6585      +/-   ##
===============================================
+ Coverage        82.09%   82.11%   +0.02%     
===============================================
  Files               97       97              
  Lines            16474    16477       +3     
===============================================
+ Hits             13524    13530       +6     
+ Misses            2950     2947       -3

Impacted Files	Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/hash_vocab_utils.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`91.48% <0.00%> (+4.25%)`	⬆️
python/cudf/cudf/utils/gpu_utils.py	`58.53% <0.00%> (+4.87%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31c0d29...6cc32d8. Read the comment docs.

cpp/src/groupby/hash/groupby.cu

jrhemstad · 2020-10-26T22:03:26Z

I realized that this PR should wait until #6392 as it will likely have a lot of conflicts.

davidwendt · 2020-12-15T13:42:51Z

I cannot get around the 10.2 ptxas compile segfault as documented here https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=3186317
The code logic that seems to cause the error is isolated in the aggregation sum specialization for dictionary type. This means we cannot support aggregation-sum-groupby for dictionary columns as values. And aggregation-sum is also required in order to support mean, std, and var aggregations. Using a dictionary column as keys for groupby work fine.

The PR includes logic to throw an exception if a dictionary column is used as values with one of these aggregation types. The code is isolated around a CUDA==10.2 equivalent compile directive so technically these aggregations work if compiled with 10.1 or 11.0 but the gtests have been commented out.

cpp/src/groupby/sort/group_quantiles.cu

davidwendt · 2020-12-16T13:41:29Z

rerun tests

mike-wendt · 2020-12-16T19:48:00Z

rerun tests

cpp/tests/groupby/group_sum_test.cpp

cpp/tests/groupby/group_var_test.cpp

cpp/tests/groupby/group_sum_test.cpp

cpp/tests/groupby/group_nth_element_test.cpp

cpp/tests/groupby/group_mean_test.cpp

cpp/src/groupby/sort/group_std.cu

cpp/src/groupby/hash/groupby.cu

devavret

Damn. I missed it by a minute.

devavret · 2021-01-05T21:09:20Z

cpp/src/groupby/sort/group_std.cu


    // prevent divide by zero error
    if (group_size == 0 or group_size - ddof <= 0) return 0.0;

-    ResultType mean = d_means.element<ResultType>(group_idx);
+    ResultType mean = d_means[group_idx];


This is sort of the reverse direction of what I've been doing. column_device_view's element() accessor is more helpful than simple subscript. e.g. in fixed point. element() will return the value with the scale applied. And that scale is stored once in the column.

Any particular reason for this change?

davidwendt added 4 commits October 22, 2020 09:33

tracing argmax

4f8306a

Merge branch 'branch-0.17' into dictionary-groupby

31d7023

testing dictionary in groupby

489cfc4

update changelog

7c13b2f

davidwendt self-assigned this Oct 22, 2020

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Oct 22, 2020

jrhemstad reviewed Oct 23, 2020

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

davidwendt added 4 commits October 23, 2020 11:29

added count, max, min, nunique

27f1737

Merge branch 'branch-0.17' into dictionary-groupby

90ecc7d

first-pass groupby-sum with dictionary col

e42f02b

Merge branch 'branch-0.17' into dictionary-groupby

bdf0395

davidwendt added 15 commits October 27, 2020 17:12

dictionary support in groupby:sum

153ad3b

dictionary support in groupby:quantile

bdcb321

add dictionary gtests for groupby:mean

13303cd

add dictionary gtests for groupby:median

a1d4eb3

Merge branch 'branch-0.17' into dictionary-groupby

7e8b3dd

dictionary support in groupby:std and groupby:var

97031b7

Merge branch 'branch-0.17' into dictionary-groupby

de61afa

Merge branch 'branch-0.17' into dictionary-groupby

8444c5b

dictionary gtest for groupby:nth-element

6fec233

change type.id()==DICTIONARY32 to is_dictionary(type)

afefa66

Merge branch 'branch-0.17' into dictionary-groupby

08d324b

create/use make_dictionary_pair_iterator

10c7528

add dictionary test to groupby collect gtest

634c32e

Merge branch 'branch-0.17' into dictionary-groupby

a13280e

Merge branch 'branch-0.17' into dictionary-groupby

5cf5eed

davidwendt added 4 commits December 14, 2020 12:46

add keys() and indices() accessors for dictionary_column_wrapper

fb1ef2d

fix check for 10.2 in groupby api

83bf836

add dictionary to keys tests

8957121

remove keys tests for dictionary columns

5b16848

davidwendt requested a review from karthikeyann December 14, 2020 17:59

davidwendt added 2 commits December 15, 2020 08:27

remove unneeded include from tests source files

2cc5f62

Merge branch 'branch-0.18' into dictionary-groupby

1c297f5

jrhemstad reviewed Dec 15, 2020

View reviewed changes

cpp/src/groupby/sort/group_quantiles.cu Show resolved Hide resolved

Merge branch 'branch-0.18' into dictionary-groupby

0e016e1

Merge branch 'branch-0.18' into dictionary-groupby

2715cd1

karthikeyann requested changes Jan 5, 2021

View reviewed changes

Merge branch 'branch-0.18' into dictionary-groupby

fc953a3

jrhemstad reviewed Jan 5, 2021

View reviewed changes

cpp/src/groupby/hash/groupby.cu Show resolved Hide resolved

jrhemstad approved these changes Jan 5, 2021

View reviewed changes

changed commented out gtests to DISABLED tests

6cc32d8

davidwendt requested a review from karthikeyann January 5, 2021 15:35

karthikeyann approved these changes Jan 5, 2021

View reviewed changes

harrism added 6 - Okay to Auto-Merge 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 5, 2021

rapids-bot bot merged commit 6828e2c into rapidsai:branch-0.18 Jan 5, 2021

devavret reviewed Jan 5, 2021

View reviewed changes

davidwendt deleted the dictionary-groupby branch January 11, 2021 13:31

ttnghia mentioned this pull request Mar 18, 2021

Implement groupby collect_set #7420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add dictionary support to libcudf groupby functions #6585

[REVIEW] Add dictionary support to libcudf groupby functions #6585

davidwendt commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 23, 2020 •

edited

Loading

jrhemstad commented Oct 26, 2020

davidwendt commented Dec 15, 2020

davidwendt commented Dec 16, 2020

mike-wendt commented Dec 16, 2020

devavret left a comment

devavret Jan 5, 2021

[REVIEW] Add dictionary support to libcudf groupby functions #6585

[REVIEW] Add dictionary support to libcudf groupby functions #6585

Conversation

davidwendt commented Oct 22, 2020 • edited Loading

codecov bot commented Oct 23, 2020 • edited Loading

Codecov Report

jrhemstad commented Oct 26, 2020

davidwendt commented Dec 15, 2020

davidwendt commented Dec 16, 2020

mike-wendt commented Dec 16, 2020

devavret left a comment

Choose a reason for hiding this comment

devavret Jan 5, 2021

Choose a reason for hiding this comment

davidwendt commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 23, 2020 •

edited

Loading