Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement groupby MERGE_LISTS and MERGE_SETS aggregates #8436

Merged
merged 25 commits into from
Jun 22, 2021

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 3, 2021

Groupby aggregations can be performed for distributed computing by the following approach:

  • Divide the dataset into batches
  • Run separate (distributed) aggregations over those batches on the distributed nodes
  • Merge the results of the step above into one final result by calling groupby::aggregate a final time on the master node

This PR supports merging operations for the lists resulted from distributed aggregate collect_list and collect_set.

Closes #7839.

@ttnghia ttnghia self-assigned this Jun 3, 2021
@ttnghia ttnghia requested review from a team as code owners June 3, 2021 20:12
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jun 3, 2021
@ttnghia ttnghia marked this pull request as draft June 3, 2021 20:14
@sperlingxx
Copy link
Contributor

LGTM, so far.

@codecov
Copy link

codecov bot commented Jun 8, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@8fb1c42). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head cef9180 differs from pull request most recent head 078e90e. Consider uploading reports for the commit 078e90e to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8436   +/-   ##
===============================================
  Coverage                ?   82.59%           
===============================================
  Files                   ?      109           
  Lines                   ?    17858           
  Branches                ?        0           
===============================================
  Hits                    ?    14750           
  Misses                  ?     3108           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fb1c42...078e90e. Read the comment docs.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Jun 8, 2021
@ttnghia ttnghia requested review from a team as code owners June 15, 2021 19:29
@ttnghia ttnghia requested review from harrism and removed request for a team, isVoid and charlesbluca June 15, 2021 19:30
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 15, 2021

Rerun tests.

1 similar comment
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 16, 2021

Rerun tests.

@ttnghia ttnghia added the 0 - Blocked Cannot progress due to external reasons label Jun 17, 2021
@ttnghia ttnghia removed the 0 - Blocked Cannot progress due to external reasons label Jun 17, 2021
@harrism
Copy link
Member

harrism commented Jun 22, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a9a95f3 into rapidsai:branch-21.08 Jun 22, 2021
@ttnghia ttnghia deleted the groupby_merge_lists branch June 22, 2021 14:11
rapids-bot bot pushed a commit that referenced this pull request Jun 24, 2021
Closes #8445

This PR is to provide Java bindings for #8436.

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #8516
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support MERGE_LISTS and MERGE_SETS for groupby::aggregate
4 participants