-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] strings::concatenate can overflow and cause data corruption #12087
Comments
edit: I accidentally conflated the two types of column-wise/string-wise concatenation. See below.
cudf/cpp/src/copying/concatenate.cu Lines 232 to 233 in 52dbb63
cudf/cpp/src/strings/strings_column_view.cpp Lines 51 to 55 in 52dbb63
|
There would be some hurdles along the way, not just simply summing up |
This is not the same concatenate. This is a horizontal concatenate of row-wise elements. There is no computation of The |
I just realized that the check @bdice pointed out above actually doesn't consider nulls thus it is very conservative and also not accurate. But that's fine. |
To be clear this shows up in a number of places almost all of which, in theory, could overflow. strings::concatenate is one of this in-particular that is especially problematic. But it happens when casting to a string from a number. It happens when calculating lowercase or upper case for a string, as that can change the number of bytes without changing the number of characters. There are many different places that this could happen most of them use |
I think the main reason why people are reluctant for adding such safe bound check is the performance impact. So we may alleviate that impact by adding an optional bool parameter like |
Definitely not this. We've gone down this path before and we learned the hard way that this was a mistake. Not something we want to repeat. I discussed this at length in #5505 |
@revans2 correct me if I'm wrong, but this problem isn't fundamentally different than the problem we had with In other words, you don't know the size of the join ahead of time and it could overflow. So we return the gather map and you can We'd need an equivalent solution, but for basically every string API. This would be a lot of work, but one solution might be to explicitly and publicly expose the fact that every string API is implemented in two phases, i.e., phase 1 computes and returns the size of every output string, phase 2 materializes the final output string. If phase 1 was exposed to the user, then they could check for themselves if the output size would overflow. |
Please remember that we have a lot of string APIs 😏 So probably this approach would require modifying the API interface to be somewhat similar to |
We would have to expose the calculation of the length of each output string in bytes, not the offsets because the overflow happens when calculating the offsets.That would work, but it would also preclude what happens today where the offsets get reused as an intermediate storage location. So at a minimum we would need to allocate a buffer for the sizes and another buffer for the offsets in a different call. But is there a reason why we cannot do my proposal? Do the |
That's also what I did for |
I'm going to try Bobby's idea and run some benchmarks. |
I was not able to get |
I tried an alternate approach which does an
|
For me the 10 ms hit for 17 million strings is worth it. But I can see how others might not agree. |
I was able to get a custom output iterator to work with exclusive-scan so it could hold the final sum in
This is much better and looks promising to support overall. I will work on a PR to add this to the |
That is really great to hear. Thanks so much for working on this. |
…w in offsets (#12180) Add a new iterator that can be used with scan functions to also return the last element value with a higher precision than the scan type. This is used in the `cudf::strings::detail::make_strings_children` utility to convert output string sizes into offsets. The iterator used with `thrust::exclusive_scan` to compute an overall result that can be checked with the max of `size_type`. The iterator provides minimal overhead to save the last entry of the scan. An error is thrown if the reduction value exceeds max of `size_type`. A custom input iterator is not required since the `thrust::exclusive_scan` uses the init parameter type (set as 0) as the type used for the accumulator for the scan. The values are passed to the iterator with this type as well. The iterator then simply casts the output to the scan result iterator and saves the last value in a separate variable. Closes #12087 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #12180
Describe the bug
This is actually a generic issue with a lot of string operations. If I try to call strings::concatenate on string columns that would produce more data than can fit in a string column (2 GiB) I can get an overflow where CUDF tries to allocate a negative amount of memory. If I go even larger and would go over 4 GiB of data in the final output the result overflows twice and we end up allocating memory, but walk off the end of the data.
I know we don't want to introspect the data for error checks. I have tried to do a prototype of this myself and I am no C++ expert, but I have heard that @davidwendt did something similar in another place. If we could do the exclusive scan as an
int64_t
instead of as anint32_t
and when writing the result tod_offsets
we would cast it back to anint32_t
. For the last offset, the one that we care about, we would also save it to another place as anint64_t
. This would be the length of the character buffer. We could then do size checks on this and verify it will fit before we try to allocate it.Even if it did require an extra kernel call in the common path, there is no way for a user to detect this type of overflow ahead of time. It would result in everyone doing two new kernel calls. The first one would compute the lengths just like today and another one would to a SUM on all of the values as a long to see if we would overflow. It is a huge overhead compared to making a hopefully small modification to the existing code. I also don't see how it would slow things down because the int64_t would only happen within the GPU kernel it would not be read or written, except for the very last offset.
Steps/Code to reproduce bug
make a byte column that contains 10 in it. The length should be
max_value<int32_t> / 2 + 1
. Convert that byte column to a string and the result will overflow and try to allocate a negative value. If you want it to walk off the end of the string use a short column with the value 1010 in it. This time when converting to a string it will write off the end of memory, and I often see "an illegal memory access was encountered".Expected behavior
We get an exception back that the input is too large to allocate instead of trying to allocate a negative result or even worse a really small positive result.
The text was updated successfully, but these errors were encountered: