Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strings::repeat_strings API that can repeat each string a different number of times #8561

Merged
merged 37 commits into from
Jul 20, 2021
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
f0b5ff7
Add doxygen
ttnghia Jun 17, 2021
a61c1e7
Finish implementation
ttnghia Jun 18, 2021
2639f9a
Finish unit tests
ttnghia Jun 18, 2021
dbbfbf9
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jun 18, 2021
0dec873
Fix merge conflicts
ttnghia Jun 18, 2021
143f853
Rename parameter back to `input`
ttnghia Jun 21, 2021
b39bb06
Fix typo
ttnghia Jun 21, 2021
ae40591
Rewrite using type_dispatcher for different integer types
ttnghia Jun 21, 2021
8612f61
Fix comment typo
ttnghia Jun 21, 2021
3dfec42
Remove input check for int32_t data type
ttnghia Jun 21, 2021
d230498
Remove bool type from the expecting types for `repeat_times` data type
ttnghia Jun 21, 2021
f534372
Implement overflow check for the new API, as it can't be done outside…
ttnghia Jun 21, 2021
554d20d
Update doxygen
ttnghia Jun 21, 2021
d00ba01
Add typed tests for various types of `repeat_times` column
ttnghia Jun 21, 2021
5b5c2a4
Fix doxygen
ttnghia Jun 21, 2021
d488eca
Simplify overflow checking
ttnghia Jun 21, 2021
8498fc6
Just re-order code
ttnghia Jun 21, 2021
855a774
Add a parameter to allow turning on/off overflow checking
ttnghia Jun 24, 2021
e5f5db8
Implement overflow checking
ttnghia Jun 25, 2021
50f05fd
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jun 25, 2021
0bcf8d8
Redesign the API and update doxygen
ttnghia Jun 25, 2021
c7b7c3b
Add an optional column of pre-computed output strings offsets
ttnghia Jul 7, 2021
f6d7ee3
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jul 8, 2021
d22c4e5
Finish implementation
ttnghia Jul 8, 2021
6124e83
Fix JNI
ttnghia Jul 8, 2021
90517aa
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jul 8, 2021
7795f54
Cleanup
ttnghia Jul 8, 2021
9158e4f
Remove duplicate code
ttnghia Jul 9, 2021
5e37782
Add test for computing string output sizes that causes overflow
ttnghia Jul 9, 2021
4dfca75
Fix test build error
ttnghia Jul 9, 2021
95cb6c0
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jul 9, 2021
91c414e
Simple fix comment typo
ttnghia Jul 9, 2021
00ad095
Address review comments, fixing doxygen and some code improvements
ttnghia Jul 19, 2021
b7956e4
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jul 19, 2021
bebb9e6
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jul 19, 2021
5c554c1
Cleanup header
ttnghia Jul 20, 2021
b294f52
Merge branch 'branch-21.08' into repeat_strings
ttnghia Jul 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 98 additions & 14 deletions cpp/include/cudf/strings/repeat_strings.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,13 @@ namespace strings {
/**
* @brief Repeat the given string scalar by a given number of times.
*
* For a given string scalar, an output string scalar is generated by repeating the input string by
* a number of times given by the @p `repeat_times` parameter. If `repeat_times` is not a positive
* value, an empty (valid) string scalar will be returned. An invalid input scalar will always
* result in an invalid output scalar regardless of the value of `repeat_times` parameter.
* An output string scalar is generated by repeating the input string by a number of times given by
* the @p `repeat_times` parameter.
*
* In special cases:
* - If @p `repeat_times` is not a positive value, an empty (valid) string scalar will be returned.
* - An invalid input scalar will always result in an invalid output scalar regardless of the
* value of @p `repeat_times` parameter.
*
* @code{.pseudo}
* Example:
Expand All @@ -47,26 +50,31 @@ namespace strings {
* (i.e., `input.size() * repeat_times > numeric_limits<size_type>::max()`).
*
* @param input The scalar containing the string to repeat.
* @param repeat_times The number of times the `input` string is copied to the output.
* @param repeat_times The number of times the input string is repeated.
* @param mr Device memory resource used to allocate the returned string scalar.
* @return New string scalar in which the string is repeated from the input.
* @return New string scalar in which the input string is repeated.
*/
std::unique_ptr<string_scalar> repeat_strings(
std::unique_ptr<string_scalar> repeat_string(
string_scalar const& input,
size_type repeat_times,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Repeat each string in the given strings column by a given number of times.
*
* For a given strings column, an output strings column is generated by repeating each string from
* the input by a number of times given by the @p `repeat_times` parameter. If `repeat_times` is not
* a positive value, all the rows of the output strings column will be an empty string. Any null row
* will result in a null row regardless of the value of `repeat_times` parameter.
* An output strings column is generated by repeating each string from the input strings column by a
* number of times given by the @p `repeat_times` parameter.
*
* In special cases:
* - If @p `repeat_times` is not a positive number, a non-null input string will always result in
* an empty output string.
* - A null input string will always result in a null output string regardless of the value of the
* @p `repeat_times` parameter.
*
* Note that this function cannot handle the cases when the size of the output column exceeds the
* maximum value that can be indexed by size_type (offset_type). In such situations, an exception
* may be thrown, or the output result is undefined.
* may be thrown, or the output result is undefined. As such, the caller is responsible for checking
* output overflow to prevent runtime exception and data corruption.
*
* @code{.pseudo}
* Example:
Expand All @@ -76,15 +84,91 @@ std::unique_ptr<string_scalar> repeat_strings(
* @endcode
*
* @param input The column containing strings to repeat.
* @param repeat_times The number of times each input string is copied to the output.
* @param repeat_times The number of times each input string is repeated.
* @param mr Device memory resource used to allocate the returned strings column.
* @return New column with concatenated results.
* @return New column containing the repeated strings.
*/
std::unique_ptr<column> repeat_strings(
strings_column_view const& input,
size_type repeat_times,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Repeat each string in the given strings column by the numbers of times given in another
* numeric column.
*
* An output strings column is generated by repeating each of the input string by a number of times
* given by the corresponding row in a @p `repeat_times` numeric column. The computational time can
* be reduced if sizes of the output strings are known and provided.
*
* In special cases:
* - Any null row (from either the input strings column or the `repeat_times` column) will always
* result in a null output string.
* - If any value in the `repeat_times` column is not a positive number and its corresponding input
* string is not null, the output string will be an empty string.
*
* Note that this function cannot handle the cases when the size of the output column exceeds the
* maximum value that can be indexed by size_type (offset_type). In such situations, an exception
* may be thrown, or the output result is undefined. As such, the caller is responsible for checking
* output overflow to prevent runtime exception and data corruption.
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
*
* @code{.pseudo}
* Example:
* strs = ['aa', null, '', 'bbc-']
* repeat_times = [ 1, 2, 3, 4 ]
* out = repeat_strings(strs, repeat_times)
* out is ['aa', null, '', 'bbc-bbc-bbc-bbc-']
* @endcode
*
* @throw cudf::logic_error if the input `repeat_times` column has data type other than integer.
* @throw cudf::logic_error if the input columns have different sizes.
*
* @param input The column containing strings to repeat.
* @param repeat_times The column containing numbers of times that the corresponding input strings
* are repeated.
* @param output_strings_sizes The optional column containing pre-computed sizes of the output
* strings.
* @param mr Device memory resource used to allocate the returned strings column.
* @return New column containing the repeated strings.
*/
std::unique_ptr<column> repeat_strings(
strings_column_view const& input,
column_view const& repeat_times,
std::optional<column_view> output_strings_sizes = std::nullopt,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Compute sizes of the output strings if each string in the input strings column
* is repeated by the numbers of times given in another numeric column.
*
* During calculating the string output sizes, these sizes are also sum up (stored in an int64_t
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
* number) and returned, which can be used to detect if the input strings column can be safely
* repeated without data corruption due to overflow in string indexing.
*
* @code{.pseudo}
* Example:
* strs = ['aa', null, '', 'bbc-']
* repeat_times = [ 1, 2, 3, 4 ]
* out = repeat_strings_output_sizes(strs, repeat_times)
* out is [2, 0, 0, 16]
* @endcode
*
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
* @throw cudf::logic_error if the input `repeat_times` column has data type other than integer.
* @throw cudf::logic_error if the input columns have different sizes.
*
* @param input The column containing strings to repeat.
* @param repeat_times The column containing numbers of times that the corresponding input strings
* are repeated.
* @param mr Device memory resource used to allocate the returned strings column.
* @return A pair with the first item is an int32_t column containing sizes of the output strings,
* and the second item is an int64_t number containing the total sizes (in bytes) of the
* output strings column.
*/
std::pair<std::unique_ptr<column>, int64_t> repeat_strings_output_sizes(
strings_column_view const& input,
column_view const& repeat_times,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
ttnghia marked this conversation as resolved.
Show resolved Hide resolved

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
Loading