-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add cudf::encode
#5572
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hang on, isn't this what a dictionary column is for? Isn't this just dictionary encoding with sequential integer keys? CC @davidwendt
Yes - this PR is just exposing functionality already written for dictionary columns. I guess we could just cast the input column to a dictionary column and then grab the indices from there, but it feels like an implementation detail that the indices will always be 0 to N-1. |
Co-authored-by: Mark Harris <[email protected]>
Co-authored-by: Mark Harris <[email protected]>
Should have xref'd issue: #5498 |
cpp/tests/encode/encode_tests.cpp
Outdated
#include <tests/utilities/base_fixture.hpp> | ||
#include <tests/utilities/column_utilities.hpp> | ||
#include <tests/utilities/column_wrapper.hpp> | ||
#include <tests/utilities/table_utilities.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is tests/utilities/table_utilities.hpp
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
Co-authored-by: David <[email protected]>
Co-authored-by: David <[email protected]>
Co-authored-by: David <[email protected]>
Co-authored-by: David <[email protected]>
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #5572 +/- ##
============================================
Coverage 86.38% 86.38%
============================================
Files 76 76
Lines 13033 13036 +3
============================================
+ Hits 11258 11261 +3
Misses 1775 1775
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
* The result column is such that keys[result[i]] == input[i], | ||
* where `keys` is the set of distinct values in `input` in sorted order. | ||
* | ||
* Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surround examples with @code{.pseudo}
and @endcode
doxygen tags.
https://github.com/rapidsai/cudf/blob/branch-0.15/cpp/docs/DOCUMENTATION.md#inline-examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Done
cpp/include/cudf/transform.hpp
Outdated
* output: [{1, 2, 3, 9}, {0, 2, 0, 1, 3}] | ||
* | ||
* @param input Column containing values to be encode | ||
* @param mr Device memory resource used to allocate the returned bitmask. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param mr Device memory resource used to allocate the returned bitmask. | |
* @param mr Device memory resource used to allocate the returned column's device memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cpp/include/cudf/transform.hpp
Outdated
* The encoded values are integers in the range [0, n), where `n` | ||
* is the number of distinct values in the input column. | ||
* The result column is such that keys[result[i]] == input[i], | ||
* where `keys` is the set of distinct values in `input` in sorted order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* where `keys` is the set of distinct values in `input` in sorted order. | |
* where `keys` is the set of distinct values in `input` in sorted ascending order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
/** | ||
* @brief Encode the values of the given column as integers | ||
* | ||
* The encoded values are integers in the range [0, n), where `n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe the behavior for nulls in input
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a description - let me know if it's clear enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah but it's not really accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small doc change, other than that LGTM
@@ -52,5 +52,11 @@ std::pair<std::unique_ptr<rmm::device_buffer>, cudf::size_type> bools_to_mask( | |||
column_view const& input, | |||
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(), | |||
cudaStream_t stream = 0); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch -- fixed
Essentially moves functionality out of
encode.cu
into a public API and adds tests for it. See #5498.