Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Define and implement new stream compaction APIs copy_if, drop_nulls, apply_boolean_mask, drop_duplicate and unique_count. #3303

Merged
merged 49 commits into from
Nov 21, 2019
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
86b05ec
changes
rgsl888prabhu Oct 31, 2019
470d7a2
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into 2…
rgsl888prabhu Oct 31, 2019
0c6ad10
primary changes
rgsl888prabhu Nov 1, 2019
8331791
files are compiling
rgsl888prabhu Nov 4, 2019
f46ae31
changes for test cases
rgsl888prabhu Nov 4, 2019
d055987
merge 0.11
rgsl888prabhu Nov 4, 2019
216b367
changes to filter
rgsl888prabhu Nov 5, 2019
26b1b32
changes to return column
rgsl888prabhu Nov 5, 2019
8e755fb
ahh working
rgsl888prabhu Nov 5, 2019
b743175
All set
rgsl888prabhu Nov 5, 2019
7cfccc1
documentation
rgsl888prabhu Nov 5, 2019
a0d0e61
code changes and test cases
rgsl888prabhu Nov 5, 2019
a72ebe7
CHANGELOG
rgsl888prabhu Nov 5, 2019
ca97b2e
adding apply_boolean_mask
rgsl888prabhu Nov 7, 2019
8a1cea6
Adding apply_boolean_mask with test case
rgsl888prabhu Nov 7, 2019
8dd75d8
documentation
rgsl888prabhu Nov 7, 2019
664afea
review changes
rgsl888prabhu Nov 7, 2019
8c4f556
changes
rgsl888prabhu Nov 7, 2019
eb457ed
Merge branch 'branch-0.11' into 2948_adding_copy_if
harrism Nov 8, 2019
f15905a
unique_count and test cases
rgsl888prabhu Nov 11, 2019
4f710c7
added string test for unique count
rgsl888prabhu Nov 11, 2019
07d48c5
Added drop_duplicate test cases
rgsl888prabhu Nov 11, 2019
f75ad3a
documentation
rgsl888prabhu Nov 11, 2019
fa74c22
merge
rgsl888prabhu Nov 11, 2019
2ae0095
documentation
rgsl888prabhu Nov 11, 2019
3801ddd
cosmetic changes
rgsl888prabhu Nov 12, 2019
a544d09
doc changes
rgsl888prabhu Nov 12, 2019
00a6c6f
review changes including wrapper
rgsl888prabhu Nov 12, 2019
b20edc9
drop_duplicate to use vector of index values as keys
rgsl888prabhu Nov 13, 2019
0ffbf41
Changes apart from scatter in copy_if
rgsl888prabhu Nov 13, 2019
257b07d
documentation
rgsl888prabhu Nov 13, 2019
e076c09
addressed final set of review comments
rgsl888prabhu Nov 14, 2019
4551a99
review changes
rgsl888prabhu Nov 15, 2019
1fad669
code changes to support string in copy_if
rgsl888prabhu Nov 15, 2019
eb0641f
merge with 0.11
rgsl888prabhu Nov 15, 2019
3bb2666
missed changes
rgsl888prabhu Nov 15, 2019
c5d2f9f
merge with 0.11
rgsl888prabhu Nov 18, 2019
e9d3298
review changes
rgsl888prabhu Nov 18, 2019
653799f
specialization for gather
rgsl888prabhu Nov 18, 2019
703221a
removing the factory method
rgsl888prabhu Nov 18, 2019
856794d
review changes
rgsl888prabhu Nov 18, 2019
df0d9aa
review changes
rgsl888prabhu Nov 19, 2019
8a77cfb
review changes
rgsl888prabhu Nov 19, 2019
a1e1259
Merge branch 'branch-0.11' into 2948_adding_copy_if
harrism Nov 20, 2019
f5687b2
string test for drop_duplicates
rgsl888prabhu Nov 20, 2019
4129ed5
adding string test for cudf::gather
rgsl888prabhu Nov 21, 2019
91ef036
review changes
rgsl888prabhu Nov 21, 2019
ae85f57
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into 2…
rgsl888prabhu Nov 21, 2019
5e64ad3
review changes
rgsl888prabhu Nov 21, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@
- PR #3239 Adding floating point specialization to comparators for NaNs
- PR #3270 Move predicates files to legacy
- PR #3282 Add `num_bitmask_words`
- PR #3303 Define and implement new stream_compaction APIs `copy_if` and `drop_null`
rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved


## Bug Fixes
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,7 @@ add_library(cudf
src/transform/jit/code/kernel.cpp
src/transform/legacy/nans_to_nulls.cu
src/bitmask/legacy/bitmask_ops.cu
src/stream_compaction/drop_nulls.cu
src/stream_compaction/legacy/apply_boolean_mask.cu
src/stream_compaction/legacy/drop_nulls.cu
src/stream_compaction/legacy/drop_duplicates.cu
Expand Down
424 changes: 424 additions & 0 deletions cpp/include/cudf/detail/copy_if.cuh

Large diffs are not rendered by default.

71 changes: 71 additions & 0 deletions cpp/include/cudf/detail/stream_compaction.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cudf/cudf.h>
#include <cudf/types.hpp>

namespace cudf {
namespace experimental {
namespace detail {

rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved
/**
* @brief Filters a table to remove null elements.
*
* Filters the rows of the `input` considering specified columns from
* `keys` for validity / null values.
*
* Given an input table_view, row `i` from the input columns is copied to
* the output if the same row `i` of @p keys has at leaast @p keep_threshold
* non-null fields.
*
* This operation is stable: the input order is preserved in the output.
*
* Any non-nullable column in the input is treated as all non-null.
*
* @example input {col1: {1, 2, 3, null},
* col2: {4, 5, null, null},
* col3: {7, null, null, null}}
* keys = input
* keep_threshold = 2
*
* output {col1: {1, 2}
* col2: {4, 5}
* col3: {7, null}}
*
* @note if @p input.num_rows() is zero, or @p keys is empty or has no nulls,
* there is no error, and an empty `std::unique_ptr<table>` is returned
rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved
*
* @throws cudf::logic_error if @p keys is non-empty and keys.num_rows() is less
* than input.num_rows()
*
* @param[in] input The input `table_view` to filter.
* @param[in] keys The `table_view` to filter `input`.
* @param[in] keep_threshold The minimum number of non-null fields in a row
* required to keep the row.
* @param[in] mr Optional, The resource to use for all allocations
* @param[in] stream Optional CUDA stream on which to execute kernels
* @return unique_ptr<table> Table containing all rows of the `input` with at least @p keep_threshold non-null fields in @p keys.
*/
std::unique_ptr<experimental::table> drop_nulls(table_view const& input,
table_view const& keys,
cudf::size_type keep_threshold,
rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource(),
cudaStream_t stream = 0);
} // namespace detail
} // namespace experimental
} // namespace cudf
94 changes: 94 additions & 0 deletions cpp/include/cudf/stream_compaction.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cudf/cudf.h>
#include <cudf/types.hpp>

namespace cudf {
namespace experimental {

rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved
/**
* @brief Filters a table to remove null elements.
*
* Filters the rows of the `input` considering specified columns from
* `keys` for validity / null values.
*
* Given an input table_view, row `i` from the input columns is copied to
* the output if the same row `i` of @p keys has at leaast @p keep_threshold
* non-null fields.
*
* This operation is stable: the input order is preserved in the output.
*
* Any non-nullable column in the input is treated as all non-null.
*
* @example input {col1: {1, 2, 3, null},
* col2: {4, 5, null, null},
* col3: {7, null, null, null}}
* keys = input
* keep_threshold = 2
*
* output {col1: {1, 2}
* col2: {4, 5}
* col3: {7, null}}
*
* @note if @p input.num_rows() is zero, or @p keys is empty or has no nulls,
* there is no error, and an empty `std::unique_ptr<table>` is returned
*
* @throws cudf::logic_error if @p keys is non-empty and keys.num_rows() is less
* than input.num_rows()
*
* @param[in] input The input `table_view` to filter.
* @param[in] keys The `table_view` to filter `input`.
* @param[in] keep_threshold The minimum number of non-null fields in a row
* required to keep the row.
* @param[in] mr Optional, The resource to use for all allocations
* @return unique_ptr<table> Table containing all rows of the `input` with at least @p keep_threshold non-null fields in @p keys.
*/
std::unique_ptr<experimental::table> drop_nulls(table_view const& input,
table_view const& keys,
cudf::size_type keep_threshold,
rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource());

/**
* @brief Filters a table to remove null elements.
*
* @example input {col1: {1, 2, 3, null},
* col2: {4, 5, null, null},
* col3: {7, null, null, null}}
* keys = input
*
* output {col1: {1}
* col2: {4}
* col3: {7}}
*
* @overload drop_nulls
*
* Same as drop_nulls but defaults keep_threshold to the number of columns in
* @p keys.
*
* @param[in] input The input `table_view` to filter.
* @param[in] keys The `table_view` to filter `input`.
* @param[in] mr Optional, The resource to use for all allocations
* @return unique_ptr<table> Table containing all rows of the `input` without nulls in the columns of @p keys.
*/
std::unique_ptr<experimental::table> drop_nulls(table_view const &input,
table_view const &keys,
rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource());

} // namespace experimental
} // namespace cudf
9 changes: 5 additions & 4 deletions cpp/include/cudf/table/table_view.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -149,13 +149,14 @@ class table_view : public detail::table_view_base<column_view> {
class mutable_table_view : public detail::table_view_base<mutable_column_view> {
using detail::table_view_base<mutable_column_view>::table_view_base;

mutable_column_view& column(size_type column_index) const noexcept {
return const_cast<mutable_column_view&>(table_view_base::column(column_index));
}
public:
mutable_column_view& column(size_type column_index) const noexcept {
return const_cast<mutable_column_view&>(table_view_base::column(column_index));
}
/**---------------------------------------------------------------------------*
rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved
* @brief Creates an immutable `table_view` of the columns
*---------------------------------------------------------------------------**/
operator table_view();
operator table_view();
};

inline bool has_nulls(table_view view) {
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/wrappers/bool.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ struct bool8 {
}

private:
value_type value{0};
value_type value;
harrism marked this conversation as resolved.
Show resolved Hide resolved
};

// This is necessary for global, constant, non-fundamental types
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/wrappers/timestamps.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ namespace detail {
template <class Duration>
struct timestamp : time_point<Duration> {
/// Initialize timestamp to 1/1/1970:00:00:00
constexpr timestamp() : time_point<Duration>(Duration(0)) {};
constexpr timestamp() : time_point<Duration>(Duration()) {};
harrism marked this conversation as resolved.
Show resolved Hide resolved
constexpr timestamp(Duration d) : time_point<Duration>(d) {};
constexpr timestamp(typename Duration::rep t) : time_point<Duration>(Duration(t)) {};
};
Expand Down
129 changes: 129 additions & 0 deletions cpp/src/stream_compaction/drop_nulls.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf/table/table.hpp>
#include <cudf/table/table_view.hpp>
#include <cudf/table/table_device_view.cuh>
#include <cudf/detail/stream_compaction.hpp>
#include <cudf/detail/copy_if.cuh>

namespace {

// Returns true if the mask is true for index i in at least keep_threshold
// columns
struct valid_table_filter
{
__device__ inline
bool operator()(cudf::size_type i)
{
auto valid = [i](auto column_device_view) {
return column_device_view.is_valid(i);
};

auto count =
thrust::count_if(thrust::seq, keys_device_view.begin(), keys_device_view.end(), valid);

return (count >= keep_threshold);
}

static auto create(cudf::table_device_view const& keys,
rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved
cudf::size_type keep_threshold,
cudaStream_t stream = 0)
{
//auto keys_device_view = cudf::table_device_view::create(keys, stream);
rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved

auto deleter = [stream](valid_table_filter* f) { f->destroy(); };
std::unique_ptr<valid_table_filter, decltype(deleter)> p {
new valid_table_filter(keys, keys.num_columns(), keep_threshold),
deleter
};

return p;
}

__host__ void destroy() {
delete this;
}

valid_table_filter() = delete;
~valid_table_filter() = default;

protected:

rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved

valid_table_filter(cudf::table_device_view const& keys_device_view,
cudf::size_type num_columns,
cudf::size_type keep_threshold)
: keep_threshold(keep_threshold),
num_columns(num_columns),
keys_device_view(keys_device_view) {}

cudf::size_type keep_threshold;
cudf::size_type num_columns;
cudf::table_device_view keys_device_view;
};

} // namespace

namespace cudf {
namespace experimental {
namespace detail {

/*
* Filters a table to remove null elements.
*/
std::unique_ptr<experimental::table> drop_nulls(table_view const& input,
table_view const& keys,
cudf::size_type keep_threshold,
rmm::mr::device_memory_resource *mr,
cudaStream_t stream) {
if (keys.num_columns() == 0 || keys.num_rows() == 0 ||
not cudf::has_nulls(keys)) {
return std::make_unique<table>(input, stream, mr);
}

CUDF_EXPECTS(keys.num_rows() <= input.num_rows(),
"Column size mismatch");

auto keys_device_view = cudf::table_device_view::create(keys, stream);
auto filter = valid_table_filter::create(*keys_device_view, keep_threshold);

return cudf::experimental::detail::copy_if(input, *filter.get(), mr, stream);
}

} //namespace detail

/*
* Filters a table to remove null elements.
*/
std::unique_ptr<experimental::table> drop_nulls(table_view const& input,
table_view const& keys,
cudf::size_type keep_threshold,
rmm::mr::device_memory_resource *mr) {
return cudf::experimental::detail::drop_nulls(input, keys, keep_threshold, mr);
}
/*
* Filters a table to remove null elements.
*/
std::unique_ptr<experimental::table> drop_nulls(table_view const &input,
table_view const &keys,
rmm::mr::device_memory_resource *mr)
{
return cudf::experimental::detail::drop_nulls(input, keys, keys.num_columns(), mr);
}

} //namespace experimental
} //namespace cudf
10 changes: 9 additions & 1 deletion cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,15 @@ set(LEGACY_MERGE_TEST_SRC
ConfigureTest(LEGACY_MERGE_TEST "${LEGACY_MERGE_TEST_SRC}")

###################################################################################################
# - legacy stream compaction tests ----------------------------------------------------------------
# - stream compaction tests -----------------------------------------------------------------------

set(STREAM_COMPACTION_TEST_SRC
"${CMAKE_CURRENT_SOURCE_DIR}/stream_compaction/drop_nulls_tests.cu")

ConfigureTest(STREAM_COMPACTION_TEST "${STREAM_COMPACTION_TEST_SRC}")

###################################################################################################
# - legacy stream compaction tests -----------------------------------------------------------------------

set(LEGACY_STREAM_COMPACTION_TEST_SRC
"${CMAKE_CURRENT_SOURCE_DIR}/stream_compaction/legacy/apply_boolean_mask_tests.cu"
Expand Down
Loading