Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement drop_list_duplicates #7528

Merged
merged 25 commits into from
Mar 12, 2021

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 7, 2021

Closes #7494 and partially addresses #7414.

This is the new implementation for drop_list_duplicates, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last).

Example with null_equality=EQUAL:

input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }

Example with null_equality=UNEQUAL:

input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} }

@ttnghia ttnghia requested review from a team as code owners March 7, 2021 22:05
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Mar 7, 2021
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change and removed CMake CMake build issue labels Mar 7, 2021
@ttnghia ttnghia requested a review from a team as a code owner March 7, 2021 22:26
@github-actions github-actions bot added CMake CMake build issue conda labels Mar 7, 2021
@ttnghia ttnghia removed the CMake CMake build issue label Mar 7, 2021
@github-actions github-actions bot added the CMake CMake build issue label Mar 8, 2021
@codecov
Copy link

codecov bot commented Mar 8, 2021

Codecov Report

Merging #7528 (ae68606) into branch-0.19 (f4f4d87) will increase coverage by 0.52%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7528      +/-   ##
===============================================
+ Coverage        81.85%   82.38%   +0.52%     
===============================================
  Files              101      101              
  Lines            16883    17339     +456     
===============================================
+ Hits             13819    14284     +465     
+ Misses            3064     3055       -9     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/decimal.py 93.33% <0.00%> (-1.54%) ⬇️
python/cudf/cudf/core/abc.py 87.23% <0.00%> (-1.14%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.85% <0.00%> (-0.17%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/_compat.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f4f4d87...ae68606. Read the comment docs.

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved
rapids-bot bot pushed a commit that referenced this pull request Mar 10, 2021
#7551)

Fix the `offset_end` iterator in `lists_column_view`. Since the offset column size is one element larger than the number of column rows, the `offset_end` should be computed as `offset_begin() + size() + 1`. This can also be done by `offset_begin() + offsets().size()`.

This PR blocks #7528, thus it must be merged before that PR.

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - Jake Hemstad (@jrhemstad)
  - Mike Wilson (@hyperbolic2346)
  - Vukasin Milovanovic (@vuule)

URL: #7551
cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from davidwendt March 12, 2021 13:23
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 12, 2021

@gpucibot merge

@ttnghia ttnghia removed the CMake CMake build issue label Mar 12, 2021
@rapids-bot rapids-bot bot merged commit 8aeb14e into rapidsai:branch-0.19 Mar 12, 2021
rapids-bot bot pushed a commit that referenced this pull request Mar 23, 2021
This partially addresses #2973.

This PR implements groupby `collect_set` aggregation. The idea of this PR is to simply apply `drop_list_duplicates` (#7528) to the result generated by groupby `collect_list`, obtaining collect lists without duplicate entries.

Examples:
```
keys = {1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3};
vals = {10, 11, 10, 10, 20, 21, 21, 20, 30, 33, 32, 31};

keys_output = {1, 2, 3};
vals_output = {{10, 11}, {20, 21}, {30, 31, 32, 33}};
```

In this PR, a simple, incomplete Python binding for `collect_set` has been added, and no Java binding is implemented yet. Complete bindings for those Python/Java sides need to be implemented later in some other separate PRs.

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - Karthikeyan (@karthikeyann)
  - Keith Kraus (@kkraus14)
  - Jason Lowe (@jlowe)
  - Ashwin Srinath (@shwina)

URL: #7420
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
rapidsai#7551)

Fix the `offset_end` iterator in `lists_column_view`. Since the offset column size is one element larger than the number of column rows, the `offset_end` should be computed as `offset_begin() + size() + 1`. This can also be done by `offset_begin() + offsets().size()`.

This PR blocks rapidsai#7528, thus it must be merged before that PR.

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - Jake Hemstad (@jrhemstad)
  - Mike Wilson (@hyperbolic2346)
  - Vukasin Milovanovic (@vuule)

URL: rapidsai#7551
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
Closes rapidsai#7494 and partially addresses rapidsai#7414.

This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last).

Example with null_equality=EQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }

```

Example with null_equality=UNEQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} }

```

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - @nvdbaranec
  - David (@davidwendt)
  - Keith Kraus (@kkraus14)

URL: rapidsai#7528
@ttnghia ttnghia self-assigned this Apr 25, 2021
@ttnghia ttnghia deleted the drop_list_duplicates branch May 3, 2021 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement drop_list_duplicates
7 participants