Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug(arrow-row): calling convert_raw function cause "offset overflow" panic #6112

Open
JasonLi-cn opened this issue Jul 25, 2024 · 6 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@JasonLi-cn
Copy link
Contributor

JasonLi-cn commented Jul 25, 2024

Describe the bug

  • Datafusion Table Info:
    Having a http_url column, which DataType is Utf8, and it has a lot of distinct values.

  • Datafusion SQL

select http_url from tab group http_url
  • Panic Info
image
  • Reason
    GroupValuesRows in Datafusion stores rows using Rows, and this bug may be triggered when calling emit function.
    let values_capacity = rows.iter().map(|row| decoded_len(row, options)).sum();
    let mut offsets = BufferBuilder::<I>::new(len + 1);
    offsets.append(I::zero());
    let mut values = MutableBuffer::new(values_capacity);
    for row in rows {
    let offset = decode_blocks(row, options, |b| values.extend_from_slice(b));
    *row = &row[offset..];
    offsets.append(I::from_usize(values.len()).expect("offset overflow"))
    }

In the extreme case, if I append two Utf8 values which size are large(len1 + len2 > i32::MAX) into Rows twice, and then call convert_rows, which should also trigger the bug. 🤔

To Reproduce

Expected behavior

Additional context

@JasonLi-cn JasonLi-cn added the bug label Jul 25, 2024
@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

In newer versions of DataFusion I would expect that grouping on a string column would not use row format, but instead usse the special GroupValuesBytes:

https://github.com/apache/datafusion/blob/7db4213b71ed9e914c5a4f16954abfa20b091ae3/datafusion/physical-plan/src/aggregates/group_values/bytes.rs#L27

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

The bug you describe certainly can happen if there are large numbers of distinct large strings in a multi-column group 🤔

@tustvold
Copy link
Contributor

tustvold commented Jul 25, 2024

FWIW in general offset overflows do yield panics in arrow-rs, the additional plumbing for error handling what is almost always an unrecoverable error has been hard to justify, although I suspect in this case it could be made into an ArrowError without it being a breaking change.

Edit: I've updated this to be an enhancement, panics are not a bug

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Jul 25, 2024
@JasonLi-cn
Copy link
Contributor Author

The bug you describe certainly can happen if there are large numbers of distinct large strings in a multi-column group 🤔

Yes, this problem was first discovered in the case of a group by multi-column.

@JasonLi-cn
Copy link
Contributor Author

FWIW in general offset overflows do yield panics in arrow-rs, the additional plumbing for error handling what is almost always an unrecoverable error has been hard to justify, although I suspect in this case it could be made into an ArrowError without it being a breaking change.

Edit: I've updated this to be an enhancement, panics are not a bug

Do you mean we need to add a new function like the following?

pub fn try_decode_binary<I: OffsetSizeTrait>(
    rows: &mut [&[u8]],
    options: SortOptions,
) -> Result<GenericBinaryArray<I>, ArrowError> {
    ...
}

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Do you mean we need to add a new function like the following?

Maybe you could add a check in your code (or in datafusion 🤔 ) on the size of the string buffer and make a new record batch if they exceed 2GB or something. This might be related: apache/datafusion#9562

Making a single array with more than 2GB of string data is likely to be non ideal in a bunch of ways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants