Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] cudf strings column #2811

Merged
merged 73 commits into from
Oct 7, 2019
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
dbdb005
merge with 0.10
davidwendt Sep 9, 2019
f3a6c44
cudf strings column classes
davidwendt Sep 12, 2019
b31e398
merge with PR 2707
davidwendt Sep 16, 2019
52f774d
cudf strings column
davidwendt Sep 16, 2019
88277c0
Merge commit 'refs/pull/2207/head' of github.com:rapidsai/cudf into f…
davidwendt Sep 16, 2019
b7e5a3d
XMerge commit 'refs/pull/2207/head' of github.com:rapidsai/cudf into …
davidwendt Sep 17, 2019
559b5c8
use valid_if for creating null bitmask
davidwendt Sep 17, 2019
96b260f
fix memory leak
davidwendt Sep 17, 2019
991889f
remove unintentional nvtext changes
davidwendt Sep 17, 2019
6482aff
added strings array.cu and attributes.cu source
davidwendt Sep 18, 2019
8e881d9
add child column create utility
davidwendt Sep 19, 2019
425c81f
strings_column_handler to strings_column_view
davidwendt Sep 19, 2019
ed450f9
string_view iterator and code-points fn
davidwendt Sep 20, 2019
c4af499
added is_ and split api declarations
davidwendt Sep 20, 2019
85b3150
added a concatenate function
davidwendt Sep 23, 2019
5edc965
create from offsets factory method
davidwendt Sep 24, 2019
83ee6be
fixed missing stream, mr parms
davidwendt Sep 24, 2019
0a3ca5d
create_offsets method
davidwendt Sep 24, 2019
da26320
first strings columns gtests
davidwendt Sep 25, 2019
56bcf15
Merge commit 'refs/pull/2207/head' of github.com:rapidsai/cudf into f…
davidwendt Sep 25, 2019
69e43bd
gtests for strings column attributes
davidwendt Sep 25, 2019
6600614
finish gtests for attributes
davidwendt Sep 25, 2019
80d0c1c
array gtests
davidwendt Sep 25, 2019
3845b5c
merge with 2207
davidwendt Sep 25, 2019
0a39ea2
Merge commit 'refs/pull/2207/head' of github.com:rapidsai/cudf into f…
davidwendt Sep 26, 2019
cfd9bbc
finish gtests
davidwendt Sep 26, 2019
ad1930d
Merge commit 'refs/pull/2207/head' of github.com:rapidsai/cudf into f…
davidwendt Sep 27, 2019
1868748
Merge commit 'refs/pull/2207/head' of github.com:rapidsai/cudf into f…
davidwendt Sep 30, 2019
02a00c8
improve gather scan call
davidwendt Sep 30, 2019
c9fb410
some minor fixes per review
davidwendt Sep 30, 2019
892d7a7
refactored counts methods; fixed typos per review
davidwendt Sep 30, 2019
05ca768
remove unused var from lambda
davidwendt Sep 30, 2019
5043c7b
rename utilities.h to utilities.hpp
davidwendt Oct 1, 2019
587f18f
Merge commit 'refs/pull/2207/head' of github.com:rapidsai/cudf into f…
davidwendt Oct 1, 2019
ce9d417
update changelog
davidwendt Oct 1, 2019
60bda84
offsets +1
davidwendt Oct 1, 2019
e64a423
merge changelog
davidwendt Oct 1, 2019
8e2beec
add comments, fix variable names
davidwendt Oct 1, 2019
e8424e4
characters() -> length()
davidwendt Oct 1, 2019
7b245a1
missed copyright comment
davidwendt Oct 1, 2019
a62e0b7
Merge branch 'branch-0.10' into fea-ext-column-redesign
davidwendt Oct 2, 2019
7049704
size() to size_bytes()
davidwendt Oct 2, 2019
f601bef
use cudaMemset
davidwendt Oct 2, 2019
fd9b289
use cudf sort enums for sort()
davidwendt Oct 2, 2019
eeafe06
too many files to review
davidwendt Oct 2, 2019
0d0c66b
remove more files
davidwendt Oct 2, 2019
bae74aa
add missing mutable_column_device_view methods
davidwendt Oct 2, 2019
ec9e6ea
updates per PR review
davidwendt Oct 2, 2019
aafa9ab
fixed some documentation
davidwendt Oct 2, 2019
1750ea0
moved strings factories header to column_factories.hpp
davidwendt Oct 3, 2019
7870cd5
Merge branch 'branch-0.10' into fea-ext-column-redesign
davidwendt Oct 3, 2019
4296ecb
strings_column_view private inherit from column_view
davidwendt Oct 3, 2019
9be11c5
move string_view to cudf namespace
davidwendt Oct 3, 2019
f8c08ac
improve comments for table_device_view
davidwendt Oct 3, 2019
78e31ac
update comments for table_device_view
davidwendt Oct 3, 2019
55b3c6d
Add forward decl for string_view.
jrhemstad Oct 3, 2019
663ad06
Add string_view to STRING mapping.
jrhemstad Oct 3, 2019
04bc439
Specialize size_of for fixed-width only.
jrhemstad Oct 3, 2019
dc4bc7a
Update traits to use named function objects.
jrhemstad Oct 3, 2019
1a726c0
Merge remote-tracking branch 'david/fea-ext-column-redesign' into fea…
jrhemstad Oct 3, 2019
e5f1c82
Add string_view to traits.
jrhemstad Oct 3, 2019
9e51eae
Merge pull request #1 from jrhemstad/fea-ext-add-string-to-type-dispa…
davidwendt Oct 3, 2019
caaadea
Update column_view constructor to use is_compound.
jrhemstad Oct 3, 2019
825099c
Remove uneccessary include.
jrhemstad Oct 3, 2019
b80153f
use is_compound() in column_view_base
davidwendt Oct 3, 2019
6b3adb1
Remove noexcept for size_of_helper degerenate case.
jrhemstad Oct 3, 2019
5095814
Merge branch 'fea-ext-add-string-to-type-dispatcher' into pr/davidwen…
jrhemstad Oct 3, 2019
d641108
remove noexcept
davidwendt Oct 3, 2019
96b6e2e
Merge branch 'fea-ext-column-redesign' of github.com:davidwendt/cudf …
jrhemstad Oct 3, 2019
2c2a8d7
Merge branch 'branch-0.10' into fea-ext-column-redesign
davidwendt Oct 4, 2019
abd26a3
check for valid null mask
davidwendt Oct 4, 2019
2d19ea6
fixed more typos per review
davidwendt Oct 4, 2019
cda8b99
use cudf::test::BaseFixture instead of GdfTest
davidwendt Oct 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
- PR #2838 CSV Reader: Support ARROW_RANDOM_FILE input
- PR #2655 CuPy-based Series and Dataframe .values property
- PR #2803 Added `edit_distance_matrix()` function to calculate pairwise edit distance for each string on a given nvstrings object.
- PR #2811 Start of cudf strings column work based on 2207
- PR #2872 Add Java pinned memory pool allocator

## Improvements
Expand Down
14 changes: 8 additions & 6 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,7 @@ add_library(cudf
src/column/legacy/column.cpp
src/column/legacy/context.cpp
src/table/legacy/table.cpp
src/string/nvcategory_util.cpp
src/strings/nvcategory_util.cpp
src/join/joining.cu
src/orderby/orderby.cu
src/predicates/is_sorted.cu
Expand Down Expand Up @@ -432,13 +432,10 @@ add_library(cudf
src/table/table.cpp
src/bitmask/null_mask.cu
src/sort/sort.cu
src/strings/strings_column_factories.cu
src/strings/strings_column_view.cu
src/column/legacy/interop.cpp)

# Override RPATH for nvstrings
set_target_properties(libNVStrings PROPERTIES BUILD_RPATH "\$ORIGIN")
set_target_properties(libNVCategory PROPERTIES BUILD_RPATH "\$ORIGIN")
set_target_properties(libNVText PROPERTIES BUILD_RPATH "\$ORIGIN")

# Rename installation to proper names for later finding
set_target_properties(libNVStrings PROPERTIES OUTPUT_NAME "NVStrings")
set_target_properties(libNVCategory PROPERTIES OUTPUT_NAME "NVCategory")
Expand All @@ -447,6 +444,11 @@ set_target_properties(libNVText PROPERTIES OUTPUT_NAME "NVText")
# Override RPATH for cudf
set_target_properties(cudf PROPERTIES BUILD_RPATH "\$ORIGIN")

# Override RPATH for nvstrings
set_target_properties(libNVStrings PROPERTIES BUILD_RPATH "\$ORIGIN")
set_target_properties(libNVCategory PROPERTIES BUILD_RPATH "\$ORIGIN")
set_target_properties(libNVText PROPERTIES BUILD_RPATH "\$ORIGIN")

###################################################################################################
# - jitify ----------------------------------------------------------------------------------------

Expand Down
122 changes: 78 additions & 44 deletions cpp/include/cudf/column/column_device_view.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
#include <cudf/column/column_view.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/bit.cuh>
#include <cudf/strings/string_view.cuh>
#include <cudf/strings/strings_column_view.hpp>

namespace cudf {

Expand Down Expand Up @@ -70,34 +72,6 @@ class alignas(16) column_device_view_base {
return head<T>() + _offset;
}

/**---------------------------------------------------------------------------*
* @brief Returns reference to element at the specified index.
*
* This function accounts for the offset.
*
* @tparam T The element type
* @param element_index Position of the desired element
*---------------------------------------------------------------------------**/
template <typename T>
__device__ T const& element(size_type element_index) const noexcept {
return data<T>()[element_index];
}

/**---------------------------------------------------------------------------*
* @brief Returns `string_view` to the string element at the specified index.
*
* This function accounts for the offset.
*
* @param element_index Position of the desired string
*---------------------------------------------------------------------------**/
/*
template <>
__device__ string_view const& element<string_view>(
size_type element_index) const noexcept {
// Fill this in
}
*/

/**---------------------------------------------------------------------------*
* @brief Returns the number of elements in the column
*---------------------------------------------------------------------------**/
Expand Down Expand Up @@ -234,6 +208,32 @@ class alignas(16) column_device_view : public detail::column_device_view_base {
column_device_view& operator=(column_device_view const&) = default;
column_device_view& operator=(column_device_view&&) = default;

/**---------------------------------------------------------------------------*
* @brief Creates an instance of this class in the specified host memory
* using the device memory pointer as a base for child pointers.
*
* @param column Column view from which to create this instance.
* @param h_ptr Host memory pointer on which to place any child data.
* @param d_ptr Device memory pointer on which to base any child pointers.
*---------------------------------------------------------------------------**/
column_device_view( column_view column, ptrdiff_t h_ptr, ptrdiff_t d_ptr );
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

/**---------------------------------------------------------------------------*
* @brief Returns reference to element at the specified index.
*
* If the element at the specified index is NULL, i.e., `is_null(element_index) == true`,
* then any attempt to use the result will lead to undefined behavior.
*
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* This function accounts for the offset.
*
* @tparam T The element type
* @param element_index Position of the desired element
*---------------------------------------------------------------------------**/
template <typename T>
__device__ T const element(size_type element_index) const noexcept {
return data<T>()[element_index];
}

/**---------------------------------------------------------------------------*
* @brief Factory to construct a column view that is usable in device memory.
*
Expand Down Expand Up @@ -265,6 +265,14 @@ class alignas(16) column_device_view : public detail::column_device_view_base {
*---------------------------------------------------------------------------**/
void destroy();

/**---------------------------------------------------------------------------*
* @brief Return the size in bytes of the amount of memory needed to hold a
* device view of the specified column and it's children.
*
* @param source_view The `column_view` to use for this calculation.
*---------------------------------------------------------------------------**/
static size_type extent(column_view source_view);

/**---------------------------------------------------------------------------*
* @brief Returns the specified child
*
Expand Down Expand Up @@ -308,6 +316,16 @@ class alignas(16) mutable_column_device_view
default;
mutable_column_device_view& operator=(mutable_column_device_view&&) = default;

/**---------------------------------------------------------------------------*
* @brief Creates an instance of this class in the specified host memory
* using the device memory pointer as a base for child pointers.
*
* @param column Column view from which to create this instance.
* @param h_ptr Host memory pointer on which to place any child data.
* @param d_ptr Device memory pointer on which to base any child pointers.
*---------------------------------------------------------------------------**/
mutable_column_device_view( mutable_column_view column, ptrdiff_t h_ptr, ptrdiff_t d_ptr );
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

/**---------------------------------------------------------------------------*
* @brief Factory to construct a column view that is usable in device memory.
*
Expand All @@ -327,7 +345,8 @@ class alignas(16) mutable_column_device_view
* @return A `unique_ptr` to a `mutable_column_device_view` that makes the
*data from `source_view` available in device memory.
*---------------------------------------------------------------------------**/
static auto create(mutable_column_view source_view, cudaStream_t stream = 0);
static std::unique_ptr<mutable_column_device_view, std::function<void(mutable_column_device_view*)>>
create(mutable_column_view source_view, cudaStream_t stream = 0);

/**---------------------------------------------------------------------------*
* @brief Returns pointer to the base device memory allocation casted to
Expand Down Expand Up @@ -376,21 +395,6 @@ class alignas(16) mutable_column_device_view
return data<T>()[element_index];
}

/**---------------------------------------------------------------------------*
* @brief Returns `string_view` to the string element at the specified index.
*
* This function accounts for the offset.
*
* @param element_index Position of the desired string
*---------------------------------------------------------------------------**/
/*
template <>
__device__ string_view& element<string_view>(
size_type element_index) noexcept {
// Fill this in
}
*/

/**---------------------------------------------------------------------------*
* @brief Returns raw pointer to the underlying bitmask allocation.
*
Expand Down Expand Up @@ -454,6 +458,14 @@ class alignas(16) mutable_column_device_view
null_mask()[word_index] = new_word;
}

/**---------------------------------------------------------------------------*
* @brief Return the size in bytes of the amount of memory needed to hold a
* device view of the specified column and it's children.
*
* @param source_view The `column_view` to use for this calculation.
*---------------------------------------------------------------------------**/
static size_type extent(mutable_column_view source_view);

private:
mutable_column_device_view*
mutable_children{}; ///< Array of `mutable_column_device_view`
Expand All @@ -479,6 +491,28 @@ class alignas(16) mutable_column_device_view
* allocated to hold the child views.
*---------------------------------------------------------------------------**/
void destroy();

};

/**---------------------------------------------------------------------------*
* @brief Returns `string_view` to the string element at the specified index.
*
* If the element at the specified index is NULL, i.e., `is_null(element_index) == true`,
* then any attempt to use the result will lead to undefined behavior.
*
* This function accounts for the offset.
*
* @param element_index Position of the desired string element
* @return string_view instance representing this element at this index
*---------------------------------------------------------------------------**/
template <>
__device__ inline string_view const column_device_view::element<string_view>(
size_type element_index) const noexcept {
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
size_type index = element_index + offset(); // account for this view's _offset
const int32_t* d_offsets = d_children[strings_column_view::offsets_column_index].data<int32_t>();
const char* d_strings = d_children[strings_column_view::chars_column_index].data<char>();
size_type offset = d_offsets[index];
return string_view{d_strings + offset, d_offsets[index+1] - offset};
}

} // namespace cudf
66 changes: 66 additions & 0 deletions cpp/include/cudf/column/column_factories.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
#include <cudf/types.hpp>
#include "column.hpp"

#include <rmm/thrust_rmm_allocator.h>

namespace cudf {
/**---------------------------------------------------------------------------*
* @brief Construct column with sufficient uninitialized storage
Expand All @@ -43,4 +45,68 @@ std::unique_ptr<column> make_numeric_column(
data_type type, size_type size, mask_state state = UNALLOCATED,
cudaStream_t stream = 0,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

/**---------------------------------------------------------------------------*
* @brief Construct STRING type column given an array of pointer/size pairs.
* The total number of char bytes must not exceed the maximum size of size_type.
* The string characters are expected to be UTF-8 encoded sequence of char bytes.
* Use the strings_column_view class to perform strings operations on this type
* of column.
*
* @note `null_count()` and `null_bitmask` are determined if a pair contains
* a null string. That is, for each pair, if `.first` is null, that string
* is considered null. Likewise, a string is considered empty (not null)
* if `.first` is not null and `.second` is 0. Otherwise the `.first` member
* must be a valid device address pointing to `.second` consecutive bytes.
*
* @throws std::bad_alloc if device memory allocation fails
*
* @param strings The pointer/size pair arrays.
* Each pointer must be a device memory address or `nullptr` (indicating a null string).
* The size must be the number of bytes.
* @param stream Optional stream for use with all memory allocation
* and device kernels
* @param mr Optional resource to use for device memory
* allocation of the column's `null_mask` and children.
*---------------------------------------------------------------------------**/
std::unique_ptr<column> make_strings_column(
const rmm::device_vector<thrust::pair<const char*,size_type>>& strings,
cudaStream_t stream = 0,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

/**---------------------------------------------------------------------------*
* @brief Construct STRING type column given an contiguous array of chars
* encoded as UTF-8, an array of byte offsets identifying individual strings
* within the char array, and a null bitmask.
* The total number of char bytes must not exceed the maximum size of size_type.
* Use the strings_column_view class to perform strings operations on this type
* of column.
* This function makes a deep copy of the strings, offsets, null_mask to create
* a new column.
*
* @throws std::bad_alloc if device memory allocation fails
*
* @param strings The contiguous array of chars in device memory.
* This char array is expected to be UTF-8 encoded characters.
* @param offsets The array of byte offsets in device memory.
* The number of elements is one more than the total number
* of strings so the offset[last] - offset[0] is the total
* number of bytes in the strings array.
* @param null_mask The array of bits specifying the null strings.
* This array must be in device memory.
* Arrow format for nulls is used for interpeting this bitmask.
* @param null_count The number of null string entries.
* @param stream Optional stream for use with all memory allocation
* and device kernels
* @param mr Optional resource to use for device memory
* allocation of the column's `null_mask` and children.
*---------------------------------------------------------------------------**/
std::unique_ptr<column> make_strings_column(
const rmm::device_vector<char>& strings,
const rmm::device_vector<size_type>& offsets,
const rmm::device_vector<bitmask_type>& null_mask,
size_type null_count,
cudaStream_t stream = 0,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

} // namespace cudf
Loading