Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] strings::concatenate can overflow and cause data corruption #12087

Closed
revans2 opened this issue Nov 7, 2022 · 17 comments · Fixed by #12180
Closed

[BUG] strings::concatenate can overflow and cause data corruption #12087

revans2 opened this issue Nov 7, 2022 · 17 comments · Fixed by #12180
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 7, 2022

Describe the bug
This is actually a generic issue with a lot of string operations. If I try to call strings::concatenate on string columns that would produce more data than can fit in a string column (2 GiB) I can get an overflow where CUDF tries to allocate a negative amount of memory. If I go even larger and would go over 4 GiB of data in the final output the result overflows twice and we end up allocating memory, but walk off the end of the data.

I know we don't want to introspect the data for error checks. I have tried to do a prototype of this myself and I am no C++ expert, but I have heard that @davidwendt did something similar in another place. If we could do the exclusive scan as an int64_t instead of as an int32_t and when writing the result to d_offsets we would cast it back to an int32_t. For the last offset, the one that we care about, we would also save it to another place as an int64_t. This would be the length of the character buffer. We could then do size checks on this and verify it will fit before we try to allocate it.

Even if it did require an extra kernel call in the common path, there is no way for a user to detect this type of overflow ahead of time. It would result in everyone doing two new kernel calls. The first one would compute the lengths just like today and another one would to a SUM on all of the values as a long to see if we would overflow. It is a huge overhead compared to making a hopefully small modification to the existing code. I also don't see how it would slow things down because the int64_t would only happen within the GPU kernel it would not be read or written, except for the very last offset.

Steps/Code to reproduce bug
make a byte column that contains 10 in it. The length should be max_value<int32_t> / 2 + 1. Convert that byte column to a string and the result will overflow and try to allocate a negative value. If you want it to walk off the end of the string use a short column with the value 1010 in it. This time when converting to a string it will write off the end of memory, and I often see "an illegal memory access was encountered".

Expected behavior
We get an exception back that the input is too large to allocate instead of trying to allocate a negative result or even worse a really small positive result.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 7, 2022
@bdice
Copy link
Contributor

bdice commented Nov 7, 2022

edit: I accidentally conflated the two types of column-wise/string-wise concatenation. See below.

The first bug I would fix is that we should raise an error if the concatenated size will exceed the limit. This is what cudf::concatenate does here:

CUDF_EXPECTS(output_size <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()),
"Total number of concatenated rows exceeds size_type range");

I wonder if this can be detected in advance without requiring a kernel call? Would a sum over the chars_size() of each column be sufficient? @davidwendt

size_type strings_column_view::chars_size() const noexcept
{
if (size() == 0) return 0;
return chars().size();
}

@ttnghia
Copy link
Contributor

ttnghia commented Nov 7, 2022

There would be some hurdles along the way, not just simply summing up chars_size(). For example, concatenating strings with nulls will result in a null thus summing up chars_size() may say "overflow INT_MAX" but the actual result may be just zero.

@davidwendt
Copy link
Contributor

The first bug I would fix is that we should raise an error if the concatenated size will exceed the limit. This is what cudf::concatenate does here:

This is not the same concatenate. This is a horizontal concatenate of row-wise elements. There is no computation of output_size and would require an extra kernel call.

The char_size() would be a good estimate but may not be accurate for sliced columns. Though it is a good maximum.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 7, 2022

concatenating strings with nulls will result in a null thus summing up chars_size() may say "overflow INT_MAX" but the actual result may be just zero.

I just realized that the check @bdice pointed out above actually doesn't consider nulls thus it is very conservative and also not accurate. But that's fine.

@revans2
Copy link
Contributor Author

revans2 commented Nov 7, 2022

To be clear this shows up in a number of places almost all of which, in theory, could overflow. strings::concatenate is one of this in-particular that is especially problematic. But it happens when casting to a string from a number. It happens when calculating lowercase or upper case for a string, as that can change the number of bytes without changing the number of characters. There are many different places that this could happen most of them use make_strings_children in include/cudf/strings/detail/utilities.cuh if we could make the change that I suggested when calculating the offsets we are fixing lots of potential problems all at once.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 7, 2022

I think the main reason why people are reluctant for adding such safe bound check is the performance impact. So we may alleviate that impact by adding an optional bool parameter like bound_check so the bound check will only execute if the user explicitly wants it. Otherwise just the old fast code path will be executed.

@jrhemstad
Copy link
Contributor

jrhemstad commented Nov 7, 2022

So we may alleviate that impact by adding an optional bool parameter like bound_check so the bound check will only execute if the user explicitly wants it.

Definitely not this. We've gone down this path before and we learned the hard way that this was a mistake. Not something we want to repeat.

I discussed this at length in #5505

@jrhemstad
Copy link
Contributor

@revans2 correct me if I'm wrong, but this problem isn't fundamentally different than the problem we had with join APIs that led us to having the join APIs return the gather map in addition to adding the row_bit_count API, right?

In other words, you don't know the size of the join ahead of time and it could overflow. So we return the gather map and you can gather from the row_bit_count results and get how big the resulting column would be.

We'd need an equivalent solution, but for basically every string API.

This would be a lot of work, but one solution might be to explicitly and publicly expose the fact that every string API is implemented in two phases, i.e., phase 1 computes and returns the size of every output string, phase 2 materializes the final output string.

If phase 1 was exposed to the user, then they could check for themselves if the output size would overflow.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 7, 2022

If phase 1 was exposed to the user, then they could check for themselves if the output size would overflow.

Please remember that we have a lot of string APIs 😏

So probably this approach would require modifying the API interface to be somewhat similar to cub APIs, instead of having 2 separate interfaces for each one. You just have one interface for each API but the output will be different (output of phase-1 or phase-2) depending on the input.

@revans2
Copy link
Contributor Author

revans2 commented Nov 7, 2022

If phase 1 was exposed to the user, then they could check for themselves if the output size would overflow.

We would have to expose the calculation of the length of each output string in bytes, not the offsets because the overflow happens when calculating the offsets.That would work, but it would also preclude what happens today where the offsets get reused as an intermediate storage location. So at a minimum we would need to allocate a buffer for the sizes and another buffer for the offsets in a different call.

But is there a reason why we cannot do my proposal? Do the exclusive_scan with a transform_iterator converting the d_offsets to an int64_t and an output_iterator that would write the results out as an int32_t to d_offsets, but also for the last value would write it out as an int64_t so we could check if it overflowed. There would be no extra kernel calls, no extra data transfers. It should just work and be as fast as what we have today. I am not a thrust expert so there could be something that I am missing, as I have not made it work yet. But it feels like we can get the check without any extra cost.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 7, 2022

But is there a reason why we cannot do my proposal? Do the exclusive_scan with a transform_iterator converting the d_offsets to an int64_t and an output_iterator that would write the results out as an int32_t to d_offsets, but also for the last value would write it out as an int64_t so we could check if it overflowed. There would be no extra kernel calls, no extra data transfers.

That's also what I did for strings::repeat_strings (#8561) because overflow is too likely for that API and checking overflow was unavoidable. In addition, the "phase-1" output of that API was also made public exactly as the approach @jrhemstad mentioned.

@davidwendt davidwendt self-assigned this Nov 7, 2022
@davidwendt
Copy link
Contributor

I'm going to try Bobby's idea and run some benchmarks.

@davidwendt
Copy link
Contributor

I was not able to get thrust::exclusive_scan to accept a custom iterator.
Here is the simplest one I tried: https://godbolt.org/z/jM51nWqf7
Here is a more complex one which fails from cub: https://godbolt.org/z/dz16T6bbE

@davidwendt
Copy link
Contributor

I tried an alternate approach which does an atomicAdd on the sizes as they are computed in the first pass kernel.
Here are the benchmark results:

Benchmark                                          Time         CPU  Time Old  Time New  CPU Old  CPU New
---------------------------------------------------------------------------------------------------------
StringCombine/concat/4096/16/manual_time        +0.1062     +0.0890         0         0        0        0
StringCombine/concat/4096/64/manual_time        +0.0811     +0.0713         0         0        0        0
StringCombine/concat/4096/256/manual_time       +0.0269     +0.0257         0         0        0        0
StringCombine/concat/4096/1024/manual_time      +0.0067     +0.0068         1         1        1        1
StringCombine/concat/32768/16/manual_time       +0.3137     +0.2645         0         0        0        0
StringCombine/concat/32768/64/manual_time       +0.1709     +0.1561         0         0        0        0
StringCombine/concat/32768/256/manual_time      +0.0558     +0.0542         0         1        1        1
StringCombine/concat/32768/1024/manual_time     +0.0238     +0.0237         2         2        2        2
StringCombine/concat/262144/16/manual_time      +1.1988     +1.0865         0         0        0        0
StringCombine/concat/262144/64/manual_time      +0.3000     +0.2922         1         1        1        1
StringCombine/concat/262144/256/manual_time     +0.0080     +0.0081        22        22       22       22
StringCombine/concat/262144/1024/manual_time    +0.0017     +0.0018       120       120      120      120
StringCombine/concat/2097152/16/manual_time     +2.0618     +2.0117         1         2        1        2
StringCombine/concat/2097152/64/manual_time     +0.3325     +0.3316         4         6        4        6
StringCombine/concat/2097152/256/manual_time    +0.0060     +0.0062       195       196      195      196
StringCombine/concat/16777216/16/manual_time    +2.2769     +2.2708         5        17        5       17
StringCombine/concat/16777216/64/manual_time    +0.3475     +0.3478        34        45       34       45

@revans2
Copy link
Contributor Author

revans2 commented Nov 10, 2022

For me the 10 ms hit for 17 million strings is worth it. But I can see how others might not agree.

@davidwendt
Copy link
Contributor

davidwendt commented Nov 14, 2022

I was able to get a custom output iterator to work with exclusive-scan so it could hold the final sum in std::size_t precision while still supporting size_type for the offsets output. Here are the benchmark results for cudf::strings::concatenate with this implementation:

Benchmark                                          Time       CPU  Time Old  Time New  CPU Old  CPU New
--------------------------------------------- ---------------------------------------------------------
StringCombine/concat/4096/16/manual_time        +0.0335   +0.0304         0         0        0        0
StringCombine/concat/4096/64/manual_time        +0.0172   +0.0168         0         0        0        0
StringCombine/concat/4096/256/manual_time       +0.0209   +0.0205         0         0        0        0
StringCombine/concat/4096/1024/manual_time      +0.0133   +0.0133         1         1        1        1
StringCombine/concat/32768/16/manual_time       +0.0300   +0.0267         0         0        0        0
StringCombine/concat/32768/64/manual_time       +0.0234   +0.0223         0         0        0        0
StringCombine/concat/32768/256/manual_time      +0.0046   +0.0046         0         0        0        0
StringCombine/concat/32768/1024/manual_time     -0.0083   -0.0080         1         1        1        1
StringCombine/concat/262144/16/manual_time      +0.0259   +0.0230         0         0        0        0
StringCombine/concat/262144/64/manual_time      +0.0064   +0.0064         1         1        1        1
StringCombine/concat/262144/256/manual_time     -0.0145   -0.0147        14        14       14       14
StringCombine/concat/262144/1024/manual_time    +0.0396   +0.0396        77        80       77       80
StringCombine/concat/2097152/16/manual_time     +0.0101   +0.0101         1         1        1        1
StringCombine/concat/2097152/64/manual_time     +0.0039   +0.0040         8         8        8        8
StringCombine/concat/2097152/256/manual_time    -0.0014   -0.0014       119       119      119      119
StringCombine/concat/16777216/16/manual_time    +0.0101   +0.0103         4         4        4        4
StringCombine/concat/16777216/64/manual_time    -0.0008   -0.0010        58        58       58       58

This is much better and looks promising to support overall.
(Note the baseline numbers are different than the previous comment because these runs used a different GPU)

I will work on a PR to add this to the make_strings_children utility used by concatenate and check its impact on other APIs. This will hopefully allow throwing an error for the potential overflow with minimal impact to performance.

@revans2
Copy link
Contributor Author

revans2 commented Nov 15, 2022

That is really great to hear. Thanks so much for working on this.

rapids-bot bot pushed a commit that referenced this issue Dec 8, 2022
…w in offsets (#12180)

Add a new iterator that can be used with scan functions to also return the last element value with a higher precision than the scan type. This is used in the `cudf::strings::detail::make_strings_children` utility to convert output string sizes into offsets. The iterator used with `thrust::exclusive_scan` to compute an overall result that can be checked with the max of  `size_type`. The iterator provides minimal overhead to save the last entry of the scan. An error is thrown if the reduction value exceeds max of `size_type`.
A custom input iterator is not required since the `thrust::exclusive_scan` uses the init parameter type (set as 0) as the type used for the accumulator for the scan. The values are passed to the iterator with this type as well. The iterator then simply casts the output to the scan result iterator and saves the last value in a separate variable.

Closes #12087

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #12180
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants