[BUG] strings::concatenate can overflow and cause data corruption #12087

revans2 · 2022-11-07T17:24:04Z

Describe the bug
This is actually a generic issue with a lot of string operations. If I try to call strings::concatenate on string columns that would produce more data than can fit in a string column (2 GiB) I can get an overflow where CUDF tries to allocate a negative amount of memory. If I go even larger and would go over 4 GiB of data in the final output the result overflows twice and we end up allocating memory, but walk off the end of the data.

I know we don't want to introspect the data for error checks. I have tried to do a prototype of this myself and I am no C++ expert, but I have heard that @davidwendt did something similar in another place. If we could do the exclusive scan as an int64_t instead of as an int32_t and when writing the result to d_offsets we would cast it back to an int32_t. For the last offset, the one that we care about, we would also save it to another place as an int64_t. This would be the length of the character buffer. We could then do size checks on this and verify it will fit before we try to allocate it.

Even if it did require an extra kernel call in the common path, there is no way for a user to detect this type of overflow ahead of time. It would result in everyone doing two new kernel calls. The first one would compute the lengths just like today and another one would to a SUM on all of the values as a long to see if we would overflow. It is a huge overhead compared to making a hopefully small modification to the existing code. I also don't see how it would slow things down because the int64_t would only happen within the GPU kernel it would not be read or written, except for the very last offset.

Steps/Code to reproduce bug
make a byte column that contains 10 in it. The length should be max_value<int32_t> / 2 + 1. Convert that byte column to a string and the result will overflow and try to allocate a negative value. If you want it to walk off the end of the string use a short column with the value 1010 in it. This time when converting to a string it will write off the end of memory, and I often see "an illegal memory access was encountered".

Expected behavior
We get an exception back that the input is too large to allocate instead of trying to allocate a negative result or even worse a really small positive result.

The text was updated successfully, but these errors were encountered:

bdice · 2022-11-07T18:15:33Z

edit: I accidentally conflated the two types of column-wise/string-wise concatenation. See below.

~~The first bug I would fix is that we should raise an error if the concatenated size will exceed the limit. This is what cudf::concatenate does here:~~

cudf/cpp/src/copying/concatenate.cu

Lines 232 to 233 in 52dbb63

    
           CUDF_EXPECTS(output_size <= static_cast<std::size_t>(std::numeric_limits<size_type>::max()), 
        
                        "Total number of concatenated rows exceeds size_type range");

~~I wonder if this can be detected in advance without requiring a kernel call? Would a sum over the chars_size() of each column be sufficient? @davidwendt~~

cudf/cpp/src/strings/strings_column_view.cpp

Lines 51 to 55 in 52dbb63

    
           size_type strings_column_view::chars_size() const noexcept 
        
           { 
        
             if (size() == 0) return 0; 
        
             return chars().size(); 
        
           }

ttnghia · 2022-11-07T18:22:41Z

There would be some hurdles along the way, not just simply summing up chars_size(). For example, concatenating strings with nulls will result in a null thus summing up chars_size() may say "overflow INT_MAX" but the actual result may be just zero.

davidwendt · 2022-11-07T18:26:13Z

The first bug I would fix is that we should raise an error if the concatenated size will exceed the limit. This is what cudf::concatenate does here:

This is not the same concatenate. This is a horizontal concatenate of row-wise elements. There is no computation of output_size and would require an extra kernel call.

The char_size() would be a good estimate but may not be accurate for sliced columns. Though it is a good maximum.

ttnghia · 2022-11-07T18:27:56Z

concatenating strings with nulls will result in a null thus summing up chars_size() may say "overflow INT_MAX" but the actual result may be just zero.

I just realized that the check @bdice pointed out above actually doesn't consider nulls thus it is very conservative and also not accurate. But that's fine.

revans2 · 2022-11-07T18:37:14Z

To be clear this shows up in a number of places almost all of which, in theory, could overflow. strings::concatenate is one of this in-particular that is especially problematic. But it happens when casting to a string from a number. It happens when calculating lowercase or upper case for a string, as that can change the number of bytes without changing the number of characters. There are many different places that this could happen most of them use make_strings_children in include/cudf/strings/detail/utilities.cuh if we could make the change that I suggested when calculating the offsets we are fixing lots of potential problems all at once.

ttnghia · 2022-11-07T18:49:27Z

I think the main reason why people are reluctant for adding such safe bound check is the performance impact. So we may alleviate that impact by adding an optional bool parameter like bound_check so the bound check will only execute if the user explicitly wants it. Otherwise just the old fast code path will be executed.

jrhemstad · 2022-11-07T19:05:27Z

So we may alleviate that impact by adding an optional bool parameter like bound_check so the bound check will only execute if the user explicitly wants it.

Definitely not this. We've gone down this path before and we learned the hard way that this was a mistake. Not something we want to repeat.

I discussed this at length in #5505

jrhemstad · 2022-11-07T19:13:37Z

@revans2 correct me if I'm wrong, but this problem isn't fundamentally different than the problem we had with join APIs that led us to having the join APIs return the gather map in addition to adding the row_bit_count API, right?

In other words, you don't know the size of the join ahead of time and it could overflow. So we return the gather map and you can gather from the row_bit_count results and get how big the resulting column would be.

We'd need an equivalent solution, but for basically every string API.

This would be a lot of work, but one solution might be to explicitly and publicly expose the fact that every string API is implemented in two phases, i.e., phase 1 computes and returns the size of every output string, phase 2 materializes the final output string.

If phase 1 was exposed to the user, then they could check for themselves if the output size would overflow.

ttnghia · 2022-11-07T19:20:23Z

If phase 1 was exposed to the user, then they could check for themselves if the output size would overflow.

Please remember that we have a lot of string APIs 😏

So probably this approach would require modifying the API interface to be somewhat similar to cub APIs, instead of having 2 separate interfaces for each one. You just have one interface for each API but the output will be different (output of phase-1 or phase-2) depending on the input.

revans2 · 2022-11-07T19:56:54Z

If phase 1 was exposed to the user, then they could check for themselves if the output size would overflow.

We would have to expose the calculation of the length of each output string in bytes, not the offsets because the overflow happens when calculating the offsets.That would work, but it would also preclude what happens today where the offsets get reused as an intermediate storage location. So at a minimum we would need to allocate a buffer for the sizes and another buffer for the offsets in a different call.

But is there a reason why we cannot do my proposal? Do the exclusive_scan with a transform_iterator converting the d_offsets to an int64_t and an output_iterator that would write the results out as an int32_t to d_offsets, but also for the last value would write it out as an int64_t so we could check if it overflowed. There would be no extra kernel calls, no extra data transfers. It should just work and be as fast as what we have today. I am not a thrust expert so there could be something that I am missing, as I have not made it work yet. But it feels like we can get the check without any extra cost.

ttnghia · 2022-11-07T20:49:57Z

But is there a reason why we cannot do my proposal? Do the exclusive_scan with a transform_iterator converting the d_offsets to an int64_t and an output_iterator that would write the results out as an int32_t to d_offsets, but also for the last value would write it out as an int64_t so we could check if it overflowed. There would be no extra kernel calls, no extra data transfers.

That's also what I did for strings::repeat_strings (#8561) because overflow is too likely for that API and checking overflow was unavoidable. In addition, the "phase-1" output of that API was also made public exactly as the approach @jrhemstad mentioned.

davidwendt · 2022-11-07T22:31:47Z

I'm going to try Bobby's idea and run some benchmarks.

davidwendt · 2022-11-08T19:50:34Z

I was not able to get thrust::exclusive_scan to accept a custom iterator.
Here is the simplest one I tried: https://godbolt.org/z/jM51nWqf7
Here is a more complex one which fails from cub: https://godbolt.org/z/dz16T6bbE

davidwendt · 2022-11-08T19:54:26Z

I tried an alternate approach which does an atomicAdd on the sizes as they are computed in the first pass kernel.
Here are the benchmark results:

Benchmark                                          Time         CPU  Time Old  Time New  CPU Old  CPU New
---------------------------------------------------------------------------------------------------------
StringCombine/concat/4096/16/manual_time        +0.1062     +0.0890         0         0        0        0
StringCombine/concat/4096/64/manual_time        +0.0811     +0.0713         0         0        0        0
StringCombine/concat/4096/256/manual_time       +0.0269     +0.0257         0         0        0        0
StringCombine/concat/4096/1024/manual_time      +0.0067     +0.0068         1         1        1        1
StringCombine/concat/32768/16/manual_time       +0.3137     +0.2645         0         0        0        0
StringCombine/concat/32768/64/manual_time       +0.1709     +0.1561         0         0        0        0
StringCombine/concat/32768/256/manual_time      +0.0558     +0.0542         0         1        1        1
StringCombine/concat/32768/1024/manual_time     +0.0238     +0.0237         2         2        2        2
StringCombine/concat/262144/16/manual_time      +1.1988     +1.0865         0         0        0        0
StringCombine/concat/262144/64/manual_time      +0.3000     +0.2922         1         1        1        1
StringCombine/concat/262144/256/manual_time     +0.0080     +0.0081        22        22       22       22
StringCombine/concat/262144/1024/manual_time    +0.0017     +0.0018       120       120      120      120
StringCombine/concat/2097152/16/manual_time     +2.0618     +2.0117         1         2        1        2
StringCombine/concat/2097152/64/manual_time     +0.3325     +0.3316         4         6        4        6
StringCombine/concat/2097152/256/manual_time    +0.0060     +0.0062       195       196      195      196
StringCombine/concat/16777216/16/manual_time    +2.2769     +2.2708         5        17        5       17
StringCombine/concat/16777216/64/manual_time    +0.3475     +0.3478        34        45       34       45

revans2 · 2022-11-10T16:20:24Z

For me the 10 ms hit for 17 million strings is worth it. But I can see how others might not agree.

davidwendt · 2022-11-14T22:48:40Z

I was able to get a custom output iterator to work with exclusive-scan so it could hold the final sum in std::size_t precision while still supporting size_type for the offsets output. Here are the benchmark results for cudf::strings::concatenate with this implementation:

Benchmark                                          Time       CPU  Time Old  Time New  CPU Old  CPU New
--------------------------------------------- ---------------------------------------------------------
StringCombine/concat/4096/16/manual_time        +0.0335   +0.0304         0         0        0        0
StringCombine/concat/4096/64/manual_time        +0.0172   +0.0168         0         0        0        0
StringCombine/concat/4096/256/manual_time       +0.0209   +0.0205         0         0        0        0
StringCombine/concat/4096/1024/manual_time      +0.0133   +0.0133         1         1        1        1
StringCombine/concat/32768/16/manual_time       +0.0300   +0.0267         0         0        0        0
StringCombine/concat/32768/64/manual_time       +0.0234   +0.0223         0         0        0        0
StringCombine/concat/32768/256/manual_time      +0.0046   +0.0046         0         0        0        0
StringCombine/concat/32768/1024/manual_time     -0.0083   -0.0080         1         1        1        1
StringCombine/concat/262144/16/manual_time      +0.0259   +0.0230         0         0        0        0
StringCombine/concat/262144/64/manual_time      +0.0064   +0.0064         1         1        1        1
StringCombine/concat/262144/256/manual_time     -0.0145   -0.0147        14        14       14       14
StringCombine/concat/262144/1024/manual_time    +0.0396   +0.0396        77        80       77       80
StringCombine/concat/2097152/16/manual_time     +0.0101   +0.0101         1         1        1        1
StringCombine/concat/2097152/64/manual_time     +0.0039   +0.0040         8         8        8        8
StringCombine/concat/2097152/256/manual_time    -0.0014   -0.0014       119       119      119      119
StringCombine/concat/16777216/16/manual_time    +0.0101   +0.0103         4         4        4        4
StringCombine/concat/16777216/64/manual_time    -0.0008   -0.0010        58        58       58       58

This is much better and looks promising to support overall.
(Note the baseline numbers are different than the previous comment because these runs used a different GPU)

I will work on a PR to add this to the make_strings_children utility used by concatenate and check its impact on other APIs. This will hopefully allow throwing an error for the potential overflow with minimal impact to performance.

revans2 · 2022-11-15T14:06:46Z

That is really great to hear. Thanks so much for working on this.

…w in offsets (#12180) Add a new iterator that can be used with scan functions to also return the last element value with a higher precision than the scan type. This is used in the `cudf::strings::detail::make_strings_children` utility to convert output string sizes into offsets. The iterator used with `thrust::exclusive_scan` to compute an overall result that can be checked with the max of `size_type`. The iterator provides minimal overhead to save the last entry of the scan. An error is thrown if the reduction value exceeds max of `size_type`. A custom input iterator is not required since the `thrust::exclusive_scan` uses the init parameter type (set as 0) as the type used for the accumulator for the scan. The values are passed to the iterator with this type as well. The iterator then simply casts the output to the scan result iterator and saves the last value in a separate variable. Closes #12087 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #12180

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 7, 2022

revans2 mentioned this issue Nov 7, 2022

[BUG] casting to string and list, and concat can cause overflow issues NVIDIA/spark-rapids#6636

Closed

davidwendt self-assigned this Nov 7, 2022

revans2 mentioned this issue Nov 10, 2022

[BUG] Allocating columns with small negative lengths succeeds #12116

Closed

This was referenced Nov 16, 2022

Rework specific cudf::strings and nvtext APIs to use make_strings_children utility #12167

Closed

Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets #12180

Merged

GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 19, 2022

rapids-bot bot closed this as completed in #12180 Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] strings::concatenate can overflow and cause data corruption #12087

[BUG] strings::concatenate can overflow and cause data corruption #12087

revans2 commented Nov 7, 2022

bdice commented Nov 7, 2022 •

edited

Loading

ttnghia commented Nov 7, 2022

davidwendt commented Nov 7, 2022

ttnghia commented Nov 7, 2022

revans2 commented Nov 7, 2022

ttnghia commented Nov 7, 2022

jrhemstad commented Nov 7, 2022 •

edited

Loading

jrhemstad commented Nov 7, 2022

ttnghia commented Nov 7, 2022 •

edited

Loading

revans2 commented Nov 7, 2022

ttnghia commented Nov 7, 2022

davidwendt commented Nov 7, 2022

davidwendt commented Nov 8, 2022

davidwendt commented Nov 8, 2022

revans2 commented Nov 10, 2022

davidwendt commented Nov 14, 2022 •

edited

Loading

revans2 commented Nov 15, 2022

[BUG] strings::concatenate can overflow and cause data corruption #12087

[BUG] strings::concatenate can overflow and cause data corruption #12087

Comments

revans2 commented Nov 7, 2022

bdice commented Nov 7, 2022 • edited Loading

ttnghia commented Nov 7, 2022

davidwendt commented Nov 7, 2022

ttnghia commented Nov 7, 2022

revans2 commented Nov 7, 2022

ttnghia commented Nov 7, 2022

jrhemstad commented Nov 7, 2022 • edited Loading

jrhemstad commented Nov 7, 2022

ttnghia commented Nov 7, 2022 • edited Loading

revans2 commented Nov 7, 2022

ttnghia commented Nov 7, 2022

davidwendt commented Nov 7, 2022

davidwendt commented Nov 8, 2022

davidwendt commented Nov 8, 2022

revans2 commented Nov 10, 2022

davidwendt commented Nov 14, 2022 • edited Loading

revans2 commented Nov 15, 2022

bdice commented Nov 7, 2022 •

edited

Loading

jrhemstad commented Nov 7, 2022 •

edited

Loading

ttnghia commented Nov 7, 2022 •

edited

Loading

davidwendt commented Nov 14, 2022 •

edited

Loading