Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement lists::drop_list_duplicates for keys-values lists columns #9345

Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Sep 29, 2021

This PR changes the interface of lists::drop_list_duplicates such that it may accept a second (optional) input values lists column, and returns a pairs of lists columns containing the results of copying the input column without duplicate entries.

If the optional values column is given, the users are responsible to have the keys-values columns having the same number of entries in each row. Otherwise, the results will be undefined.

When copying the key entries, the corresponding value entries are also copied at the same time. A parameter duplicate_keep_option reused from stream compaction is used to specify which duplicate keys will be copying.

This closes #9124, and blocked by #9425.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond Spark Functionality that helps Spark RAPIDS breaking Breaking change labels Sep 29, 2021
@ttnghia ttnghia self-assigned this Sep 29, 2021
@ttnghia ttnghia requested a review from a team as a code owner September 29, 2021 22:37
@ttnghia ttnghia changed the title [WIP] Reimplement lists::drop_list_duplicates for keys-values lists columns [WIP] Reimplement lists::drop_list_duplicates for keys-values lists columns [skip ci] Sep 29, 2021
rapids-bot bot pushed a commit that referenced this pull request Oct 15, 2021
Currently, stream compaction API `drop_duplicates` uses unstable sort for all of its internal sorting. This is wrong since we may want to have stable sorting results so we can choose to keep the first or the last duplicate element from the repeated sequences. 

This PR does two things:
 * Fixes the issue mentioned above by using stable sort if the input option is to keep the first or last duplicate element, and
 * Adds a new keep option into the enum class `duplicate_keep_option`:  `KEEP_ANY`. This option allows the user to choose to keep one element from the repeated sequence at any position.

Note that the issue did not show up by any failed test because thrust default (unstable) sort, which is called internally in `drop_duplicate`, still produces the same results as thrust stable sort most of the time (but this is not guaranteed). As such, the current `drop_duplicate` still produces correct results in its tests.

This PR blocks #9345.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - MithunR (https://github.com/mythrocks)

URL: #9417
rapids-bot bot pushed a commit that referenced this pull request Oct 18, 2021
…ithin each row of lists column (#9425)

This PR adds `lists::stable_sort_lists` that can sort elements within rows of lists column using stable sort. This is necessary for implementing `lists::drop_list_duplicates` that operates on keys-values columns input when we want to remove the values corresponding to duplicate keys with `KEEP_FIRST` or `KEEP_LAST` option.

In order to implement `lists::stable_sort_lists`, stable sort versions for the `segmented_sorted_order` and `segmented_sort_by_key` have also been implemented, which can maintain the order of equally-compared elements within segments. 

This PR blocks #9345.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - MithunR (https://github.com/mythrocks)

URL: #9425
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
cpp/include/cudf/lists/drop_list_duplicates.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/drop_list_duplicates.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/drop_list_duplicates.hpp Outdated Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved
}
};

/**
* @brief Struct used in type_dispatcher for comparing two entries in a lists column.
*/
struct column_row_comparator_dispatch {
offset_type const* const list_offsets;
size_type const* const list_indices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be redundant with element_equality_comparator. Likewise, the table_row_comparator_fn looks to be redundant with row_equality_comparator.

Copy link
Contributor Author

@ttnghia ttnghia Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column_row_comparator_dispatch is not the same as element_equality_comparator. In particular, it allows to have different output depending on the nans_equal parameter. element_equality_comparator does not offer such feature. In addition,
Sorry that I didn't clarify it in @brief doxygen (now added).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a relatively minor thing to duplicate such a large bit of complicated code. Remind me why nans_equal is necessary? For the most part, we've resigned to just always treat NaNs as equal in most libcudf functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a discussion before regarding that with Keith and decided to have an option allowing 2 different behaviors because:

  1. Pandas wants to have NaNs treated as equal
  2. Spark wants to have NaNs considered unequal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark wants to have NaNs considered unequal

Seriously? Most of the code in libcudf that's specialized to treat NaNs as equal was because of Spark. How wonderfully inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, Spark is "consistently inconsistent" thus in groupby it treats NaNs as equal for NaN keys (but not values). That's why we didn't need to modify groupby code for floating-point keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future work I'd like to see adding a static/dynamic option to the row/element comparator for controlling NaN equality so we don't have to maintain this code that is 99% redundant.

@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@rapidsai rapidsai deleted a comment from codecov bot Nov 10, 2021
@codecov
Copy link

codecov bot commented Nov 11, 2021

Codecov Report

Merging #9345 (3313258) into branch-21.12 (ab4bfaa) will decrease coverage by 0.10%.
The diff coverage is n/a.

❗ Current head 3313258 differs from pull request most recent head 5121878. Consider uploading reports for the commit 5121878 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9345      +/-   ##
================================================
- Coverage         10.79%   10.68%   -0.11%     
================================================
  Files               116      117       +1     
  Lines             18869    19872    +1003     
================================================
+ Hits               2036     2123      +87     
- Misses            16833    17749     +916     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 67 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d60e2e6...5121878. Read the comment docs.

@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 11, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 544643c into rapidsai:branch-21.12 Nov 11, 2021
rapids-bot bot pushed a commit that referenced this pull request Nov 12, 2021
…mn (#9553)

This PR adds JNI work for the new interface of `lists::drop_list_duplicates` that operates on keys-values input columns. It also does a small fix to remove an unused variable in `drop_list_duplicates.cu`.

Blocked by #9345.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - David Wendt (https://github.com/davidwendt)

URL: #9553
@ttnghia ttnghia deleted the drop_list_duplicates_keys_values branch November 15, 2021 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] map key deduplication with last one wins
5 participants