Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement lists::drop_list_duplicates for keys-values lists columns #9345

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
1831963
Rewrite API interface and doxygen
ttnghia Sep 29, 2021
eea4a6f
Update doxygen
ttnghia Sep 29, 2021
18dbf0e
Reuse existing `duplicate_keep_option` enum
ttnghia Sep 30, 2021
c81cdb2
WIP
ttnghia Oct 18, 2021
f4e111d
Merge branch 'branch-21.12' into drop_list_duplicates_keys_values
ttnghia Oct 19, 2021
2ecf118
Fix errors
ttnghia Oct 19, 2021
c5a1d4b
Implementation compiles
ttnghia Oct 19, 2021
8ec9518
Rewrite doxygen
ttnghia Oct 19, 2021
e830a53
Fix error
ttnghia Oct 19, 2021
d720adf
Update doxygen
ttnghia Oct 20, 2021
e058053
Fix all errors, tests passed
ttnghia Oct 20, 2021
a898560
Cleanup
ttnghia Oct 21, 2021
1177da3
Separate code into a header file
ttnghia Oct 21, 2021
6478db8
Implement duplicate_keep_option
ttnghia Oct 21, 2021
8e57842
Reorder parameters
ttnghia Oct 21, 2021
a0b5684
Fix all bugs and added unit tests
ttnghia Oct 21, 2021
db15ef8
Cleanup
ttnghia Oct 21, 2021
be5e2d4
Add comments
ttnghia Oct 22, 2021
063a2a9
Fix style
ttnghia Oct 22, 2021
c074297
Merge branch 'branch-21.12' into drop_list_duplicates_keys_values
ttnghia Oct 28, 2021
58ad58b
Rewrite doxygen
ttnghia Oct 28, 2021
caca70b
Cleanup
ttnghia Oct 28, 2021
a9dbb77
Merge branch 'branch-21.12' into drop_list_duplicates_keys_values
ttnghia Nov 2, 2021
2272094
Merge branch 'branch-21.12' into drop_list_duplicates_keys_values
ttnghia Nov 8, 2021
1e47de2
Remove staled header
ttnghia Nov 9, 2021
93d4eef
Rewrite doxygen
ttnghia Nov 9, 2021
58fff1c
Rewrite `drop_list_duplicates.cu`
ttnghia Nov 9, 2021
bae68f1
Merge branch 'branch-21.12' into drop_list_duplicates_keys_values
ttnghia Nov 9, 2021
fa866e7
Rewrite doxygen
ttnghia Nov 9, 2021
ba93300
Merge branch 'branch-21.12' into drop_list_duplicates_keys_values
ttnghia Nov 10, 2021
dadf273
Fix doxygen
ttnghia Nov 10, 2021
bdf9912
Rewrite doxygen
ttnghia Nov 10, 2021
06048ff
Rewrite doxygen
ttnghia Nov 10, 2021
2446711
Add detail interface for `normalize_nans_and_zeros` that accepts stre…
ttnghia Nov 10, 2021
7a65185
Address review comments
ttnghia Nov 10, 2021
07056ce
Fix comment typos
ttnghia Nov 10, 2021
c35d6cb
Construct `gather_map` as `device_span` instead of `column_view`
ttnghia Nov 10, 2021
6a96f74
Fix `device_span` ctor input
ttnghia Nov 10, 2021
5121878
Remove `has_value()` check and use the optional object as bool
ttnghia Nov 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 22 additions & 24 deletions cpp/include/cudf/lists/drop_list_duplicates.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,33 +29,27 @@ namespace lists {
* @file
*/

/*
* @brief Flag to specify which entry to keep when removing the duplicate entries from a repeated
* sequence.
*/
enum class keep_policy {
UNDEFINED, ///< An arbitrary entry at an unknown position in the repeated sequence will be kept.
FIRST, ///< Keep the first entry (all duplicate entries after it will be removed).
LAST ///< Keep the last entry (all duplicate entries before it will be removed).
};

/**
* @brief Create new lists columns by extracting the key list entries and their corresponding value
* entries from the given lists columns such that only the unique list entries in the `keys` column
* will be copied.
*
* In some cases, there is only a need to remove duplicates entries from one input lists column. In
* such situations, the input values lists column can be ignored.
*
* If the `values` lists column is given, the users are responsible to have the keys-values columns
* having the same number of entries in each row. Otherwise, the results will be undefined.
*
* Given a pair of keys-values lists columns, each list entry in the keys column corresponds to a
* list entry in the values column (i.e., the lists at each row index in both keys and values
* columns have the same size). The entries in both columns are copied into an output pair of keys
* and values lists columns (respectively), in a way such that the repeated key entries (and their
* and values lists columns (respectively), in a way such that the duplicate key entries (and their
* corresponding value entries) are dropped out to keep only the entries with unique keys.
*
* In some cases, there is only a need to remove duplicates entries from one input lists column. In
* such situations, the input values lists column can be ignored. If the `values` lists column is
* given, the users are responsible to have the keys-values columns having the same number of
* entries in each row. Otherwise, the results will be undefined.
*
* When generating unique entries for the output, depending on the value of @p keep_option:
* - KEEP_FIRST: only the first of a sequence of duplicate entries is copied
* - KEEP_LAST: only the last of a sequence of duplicate entries is copied
* - KEEP_ANY_ONE: one entry at an undefined position in the sequence of duplicate entries is copied
*
* The order of entries within each list of the output lists columns are not guaranteed to be
* preserved as in the input. In the current implementation, entries in the output keys lists are
* sorted by ascending order (nulls last), but this is not guaranteed in future implementation.
Expand All @@ -69,25 +63,29 @@ enum class keep_policy {
* @param nulls_equal Flag to specify whether null key entries should be considered equal.
* @param nans_equal Flag to specify whether NaN key entries should be considered as equal value
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
* (only applicable for floating point data column).
* @param keep_entry Flag to specify which entry will be kept when removing duplicate entries in the
* repeated sequence. This is only relevant when the values lists column is given.
* @param keep_option Flag to specify which entry will be kept when copying unique entries from
* the duplicate entries. This is only relevant when the values lists column is given.
* @param mr Device resource used to allocate memory.
*
* @code{.pseudo}
* input = { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
* output = { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }
* keys = { {1, 1, 2, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
* values = { {"a", "b", "c", "d"}, {"e"}, NULL, {}, {"NA", "NA", "NA", "f", "g", "h", "i", "j"} }
*
* [out_keys, out_values] = drop_list_duplicates(keys, values)
* out_keys = { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }
* out_values = { {"a", "c", "d"}, {"e"}, NULL, {}, {"f", "g", "NA"} }
* @endcode
*
* @return A pair of pointers storing to the columns resulted from removing duplicate key entries
* @return A pair of pointers storing the columns resulted from copying unique key entries
* and their corresponding values entries from the input lists columns. If the input values
* column is missing, its corresponding output will be a null pointer.
* column is missing, its corresponding output pointer will be null.
*/
std::pair<std::unique_ptr<column>, std::unique_ptr<column>> drop_list_duplicates(
lists_column_view const& keys,
std::optional<lists_column_view> const& values,
null_equality nulls_equal = null_equality::EQUAL,
nan_equality nans_equal = nan_equality::UNEQUAL,
keep_policy keep_entry = keep_policy::UNDEFINED,
duplicate_keep_option keep_option = duplicate_keep_option::KEEP_ANY_ONE,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group
Expand Down
8 changes: 5 additions & 3 deletions cpp/include/cudf/stream_compaction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -208,9 +208,11 @@ std::unique_ptr<table> apply_boolean_mask(
* @brief Choices for drop_duplicates API for retainment of duplicate rows
*/
enum class duplicate_keep_option {
KEEP_FIRST = 0, ///< Keeps first duplicate row and unique rows
KEEP_LAST, ///< Keeps last duplicate row and unique rows
KEEP_NONE ///< Keeps only unique rows are kept
KEEP_FIRST = 0, ///< Keeps first duplicate element and unique elements
KEEP_LAST, ///< Keeps last duplicate element and unique elements
KEEP_ANY_ONE, ///< Keeps one duplicate element at an undefined position and unique elements (this
///< option may not be supported in certain operations)
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
KEEP_NONE ///< Keeps only unique elements
};

/**
Expand Down
4 changes: 4 additions & 0 deletions cpp/src/stream_compaction/drop_duplicates.cu
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,10 @@ std::unique_ptr<table> drop_duplicates(table_view const& input,
return empty_like(input);
}

CUDF_EXPECTS(
keep != duplicate_keep_option::KEEP_ANY_ONE,
"The option `duplicate_keep_option::KEEP_ANY_ONE` is not yet supported in `drop_duplicates`");

auto keys_view = input.select(keys);

// The values will be filled into this column
Expand Down