Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cudf::lists::distinct in Python binding #11234

Merged
merged 38 commits into from
Jul 13, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jul 8, 2022

Python binding has lists.unique() API to extract unique list elements for the input lists column. Previously, it has been implemented by calling to cudf::lists::drop_list_duplicates, which performs segmented sort on the input lists and then extracts the unique list elements.

This PR changes the implementation of lists.unique() to use cudf::lists::distinct, which can improve performance by using a hash table for finding distinct elements without segmented sort.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 11, 2022

Rerun tests.

@ttnghia ttnghia force-pushed the use_lists_distinct_in_python branch from 9367c14 to 7323122 Compare July 11, 2022 19:54
@codecov

This comment was marked as off-topic.

@github-actions github-actions bot removed CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 12, 2022
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress 0 - Blocked Cannot progress due to external reasons labels Jul 12, 2022
@ttnghia ttnghia marked this pull request as ready for review July 12, 2022 04:37
@ttnghia ttnghia requested a review from a team as a code owner July 12, 2022 04:37
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 13, 2022
@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 62c0ae8 into rapidsai:branch-22.08 Jul 13, 2022
@ttnghia ttnghia deleted the use_lists_distinct_in_python branch July 19, 2022 02:41
rapids-bot bot pushed a commit that referenced this pull request Jul 22, 2022
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work.

Closes #11114, #11093, #11053, #11034, and closes #9257.

Depends on:
 * #11228
 * #11149
 * #11234
 * #11233

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #11236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants