Reimplement `lists::drop_list_duplicates` for keys-values lists columns #9345

ttnghia · 2021-09-29T22:37:10Z

This PR changes the interface of lists::drop_list_duplicates such that it may accept a second (optional) input values lists column, and returns a pairs of lists columns containing the results of copying the input column without duplicate entries.

If the optional values column is given, the users are responsible to have the keys-values columns having the same number of entries in each row. Otherwise, the results will be undefined.

When copying the key entries, the corresponding value entries are also copied at the same time. A parameter duplicate_keep_option reused from stream compaction is used to specify which duplicate keys will be copying.

This closes #9124, and blocked by #9425.

cpp/include/cudf/lists/drop_list_duplicates.hpp

cpp/include/cudf/stream_compaction.hpp

Currently, stream compaction API `drop_duplicates` uses unstable sort for all of its internal sorting. This is wrong since we may want to have stable sorting results so we can choose to keep the first or the last duplicate element from the repeated sequences. This PR does two things: * Fixes the issue mentioned above by using stable sort if the input option is to keep the first or last duplicate element, and * Adds a new keep option into the enum class `duplicate_keep_option`: `KEEP_ANY`. This option allows the user to choose to keep one element from the repeated sequence at any position. Note that the issue did not show up by any failed test because thrust default (unstable) sort, which is called internally in `drop_duplicate`, still produces the same results as thrust stable sort most of the time (but this is not guaranteed). As such, the current `drop_duplicate` still produces correct results in its tests. This PR blocks #9345. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) URL: #9417

…ithin each row of lists column (#9425) This PR adds `lists::stable_sort_lists` that can sort elements within rows of lists column using stable sort. This is necessary for implementing `lists::drop_list_duplicates` that operates on keys-values columns input when we want to remove the values corresponding to duplicate keys with `KEEP_FIRST` or `KEEP_LAST` option. In order to implement `lists::stable_sort_lists`, stable sort versions for the `segmented_sorted_order` and `segmented_sort_by_key` have also been implemented, which can maintain the order of equally-compared elements within segments. This PR blocks #9345. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Conor Hoekstra (https://github.com/codereport) - MithunR (https://github.com/mythrocks) URL: #9425

# Conflicts: # cpp/include/cudf/stream_compaction.hpp # cpp/src/groupby/sort/aggregate.cpp

cpp/include/cudf/lists/drop_list_duplicates.hpp

cpp/src/lists/drop_list_duplicates.cu

jrhemstad · 2021-11-10T16:26:35Z

cpp/src/lists/drop_list_duplicates.cu

  }
 };

 /**
 * @brief Struct used in type_dispatcher for comparing two entries in a lists column.
 */
 struct column_row_comparator_dispatch {
-  offset_type const* const list_offsets;
+  size_type const* const list_indices;


This looks to be redundant with element_equality_comparator. Likewise, the table_row_comparator_fn looks to be redundant with row_equality_comparator.

column_row_comparator_dispatch is not the same as element_equality_comparator. In particular, it allows to have different output depending on the nans_equal parameter. element_equality_comparator does not offer such feature. In addition,
Sorry that I didn't clarify it in @brief doxygen (now added).

That's a relatively minor thing to duplicate such a large bit of complicated code. Remind me why nans_equal is necessary? For the most part, we've resigned to just always treat NaNs as equal in most libcudf functions.

We had a discussion before regarding that with Keith and decided to have an option allowing 2 different behaviors because:

Pandas wants to have NaNs treated as equal

Spark wants to have NaNs considered unequal

Spark wants to have NaNs considered unequal

Seriously? Most of the code in libcudf that's specialized to treat NaNs as equal was because of Spark. How wonderfully inconsistent.

However, Spark is "consistently inconsistent" thus in groupby it treats NaNs as equal for NaN keys (but not values). That's why we didn't need to modify groupby code for floating-point keys.

For future work I'd like to see adding a static/dynamic option to the row/element comparator for controlling NaN equality so we don't have to maintain this code that is 99% redundant.

…am parameter

codecov · 2021-11-11T01:25:00Z

Codecov Report

Merging #9345 (3313258) into branch-21.12 (ab4bfaa) will decrease coverage by 0.10%.
The diff coverage is n/a.

❗ Current head 3313258 differs from pull request most recent head 5121878. Consider uploading reports for the commit 5121878 to get more accurate results

@@               Coverage Diff                @@
##           branch-21.12    #9345      +/-   ##
================================================
- Coverage         10.79%   10.68%   -0.11%     
================================================
  Files               116      117       +1     
  Lines             18869    19872    +1003     
================================================
+ Hits               2036     2123      +87     
- Misses            16833    17749     +916

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.90% <0.00%> (-1.21%)`	⬇️
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/hdf.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/dlpack.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
... and 67 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d60e2e6...5121878. Read the comment docs.

ttnghia · 2021-11-11T03:22:12Z

@gpucibot merge

…mn (#9553) This PR adds JNI work for the new interface of `lists::drop_list_duplicates` that operates on keys-values input columns. It also does a small fix to remove an unused variable in `drop_list_duplicates.cu`. Blocked by #9345. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - MithunR (https://github.com/mythrocks) - Robert (Bobby) Evans (https://github.com/revans2) - David Wendt (https://github.com/davidwendt) URL: #9553

Rewrite API interface and doxygen

1831963

ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond Spark Functionality that helps Spark RAPIDS breaking Breaking change labels Sep 29, 2021

ttnghia self-assigned this Sep 29, 2021

ttnghia requested a review from a team as a code owner September 29, 2021 22:37

ttnghia requested review from trxcllnt and codereport September 29, 2021 22:37

Update doxygen

eea4a6f

ttnghia requested a review from jrhemstad September 29, 2021 22:37

ttnghia changed the title ~~[WIP] Reimplement lists::drop_list_duplicates for keys-values lists columns~~ [WIP] Reimplement lists::drop_list_duplicates for keys-values lists columns [skip ci] Sep 29, 2021

ttnghia commented Sep 29, 2021

View reviewed changes

cpp/include/cudf/lists/drop_list_duplicates.hpp Outdated Show resolved Hide resolved

jrhemstad reviewed Sep 29, 2021

View reviewed changes

cpp/include/cudf/lists/drop_list_duplicates.hpp Outdated Show resolved Hide resolved

Reuse existing duplicate_keep_option enum

18dbf0e

jrhemstad reviewed Sep 30, 2021

View reviewed changes

cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved

This was referenced Oct 12, 2021

Fix stream compaction's drop_duplicates API to use stable sort #9417

Merged

Implement lists::stable_sort_lists for stable sorting of elements within each row of lists column #9425

Merged

WIP

c81cdb2

ttnghia added 6 commits October 19, 2021 09:51

Merge branch 'branch-21.12' into drop_list_duplicates_keys_values

f4e111d

# Conflicts: # cpp/include/cudf/stream_compaction.hpp # cpp/src/groupby/sort/aggregate.cpp

Fix errors

2ecf118

Implementation compiles

c5a1d4b

Rewrite doxygen

8ec9518

Fix error

e830a53

Update doxygen

d720adf