Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add cudf::encode #5572

Merged
merged 21 commits into from
Jul 22, 2020
Merged

[REVIEW] Add cudf::encode #5572

merged 21 commits into from
Jul 22, 2020

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Jun 24, 2020

Essentially moves functionality out of encode.cu into a public API and adds tests for it. See #5498.

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

cpp/src/transform/codify.cpp Outdated Show resolved Hide resolved
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on, isn't this what a dictionary column is for? Isn't this just dictionary encoding with sequential integer keys? CC @davidwendt

cpp/include/cudf/transform.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/transform.hpp Outdated Show resolved Hide resolved
@shwina
Copy link
Contributor Author

shwina commented Jun 25, 2020

Yes - this PR is just exposing functionality already written for dictionary columns. I guess we could just cast the input column to a dictionary column and then grab the indices from there, but it feels like an implementation detail that the indices will always be 0 to N-1.

@shwina
Copy link
Contributor Author

shwina commented Jun 25, 2020

Should have xref'd issue: #5498

@shwina shwina marked this pull request as ready for review July 21, 2020 19:58
@shwina shwina requested review from a team as code owners July 21, 2020 19:58
@shwina shwina requested review from harrism and rgsl888prabhu July 21, 2020 19:58
@shwina shwina changed the title [WIP] Add cudf::codify [REVIEW] Add cudf::codify Jul 21, 2020
@shwina shwina changed the title [REVIEW] Add cudf::codify [REVIEW] Add cudf::encode Jul 21, 2020
cpp/src/transform/encode.cpp Outdated Show resolved Hide resolved
cpp/src/transform/encode.cpp Outdated Show resolved Hide resolved
cpp/tests/encode/encode_tests.cpp Outdated Show resolved Hide resolved
#include <tests/utilities/base_fixture.hpp>
#include <tests/utilities/column_utilities.hpp>
#include <tests/utilities/column_wrapper.hpp>
#include <tests/utilities/table_utilities.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tests/utilities/table_utilities.hpp needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

cpp/tests/CMakeLists.txt Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 21, 2020

Codecov Report

Merging #5572 into branch-0.15 will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.15    #5572   +/-   ##
============================================
  Coverage        86.38%   86.38%           
============================================
  Files               76       76           
  Lines            13033    13036    +3     
============================================
+ Hits             11258    11261    +3     
  Misses            1775     1775           
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 92.24% <0.00%> (-0.03%) ⬇️
python/cudf/cudf/core/dataframe.py 89.54% <0.00%> (+<0.01%) ⬆️
python/cudf/cudf/core/frame.py 90.19% <0.00%> (+0.01%) ⬆️
python/cudf/cudf/core/column/column.py 88.69% <0.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79bd435...edcef21. Read the comment docs.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

* The result column is such that keys[result[i]] == input[i],
* where `keys` is the set of distinct values in `input` in sorted order.
*
* Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surround examples with @code{.pseudo} and @endcode doxygen tags.
https://github.com/rapidsai/cudf/blob/branch-0.15/cpp/docs/DOCUMENTATION.md#inline-examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Done

* output: [{1, 2, 3, 9}, {0, 2, 0, 1, 3}]
*
* @param input Column containing values to be encode
* @param mr Device memory resource used to allocate the returned bitmask.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param mr Device memory resource used to allocate the returned bitmask.
* @param mr Device memory resource used to allocate the returned column's device memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* The encoded values are integers in the range [0, n), where `n`
* is the number of distinct values in the input column.
* The result column is such that keys[result[i]] == input[i],
* where `keys` is the set of distinct values in `input` in sorted order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* where `keys` is the set of distinct values in `input` in sorted order.
* where `keys` is the set of distinct values in `input` in sorted ascending order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/**
* @brief Encode the values of the given column as integers
*
* The encoded values are integers in the range [0, n), where `n`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the behavior for nulls in input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a description - let me know if it's clear enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but it's not really accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small doc change, other than that LGTM

@@ -52,5 +52,11 @@ std::pair<std::unique_ptr<rmm::device_buffer>, cudf::size_type> bools_to_mask(
column_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
cudaStream_t stream = 0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch -- fixed

@jrhemstad jrhemstad merged commit 66c5810 into rapidsai:branch-0.15 Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants