Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve arrow-row --> StringView/BinaryView memory usage #6057

Open
Tracked by #6163 ...
alamb opened this issue Jul 15, 2024 · 0 comments
Open
Tracked by #6163 ...

Improve arrow-row --> StringView/BinaryView memory usage #6057

alamb opened this issue Jul 15, 2024 · 0 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of #5374

@XiangpengHao implemented optimized row format --> ByteView (StringView / BinaryView) encoding/decoding in #5945 / #6044

It also adds benchmarks so we can test🎉

However, as mentioned in https://github.com/apache/arrow-rs/pull/6044/files#r1676803119 the output array in #6044 will have both short and long strings even though only the long strings are used in the view definition (the short strings are included to do fast utf8 validation)

This results in more memory used for the output array than neccessary

Describe the solution you'd like

reduce memory required by output array

Describe alternatives you've considered
One idea is to use a separate utf8 validation buffer for short strings, similarly to

let read = if !self.validate_utf8 {
self.decoder.read(len, |bytes| {
let offset = array_buffer.len();
let view = make_view(bytes, buffer_id, offset as u32);
if bytes.len() > 12 {
// only copy the data to buffer if the string can not be inlined.
array_buffer.extend_from_slice(bytes);
}
// # Safety
// The buffer_id is the last buffer in the output buffers
// The offset is calculated from the buffer, so it is valid
unsafe {
output.append_raw_view_unchecked(&view);
}
Ok(())
})?
} else {
// utf8 validation buffer has only short strings. These short
// strings are inlined into the views but we copy them into a
// contiguous buffer to accelerate validation.®
let mut utf8_validation_buffer = Vec::with_capacity(4096);
let v = self.decoder.read(len, |bytes| {
let offset = array_buffer.len();
let view = make_view(bytes, buffer_id, offset as u32);
if bytes.len() > 12 {
// only copy the data to buffer if the string can not be inlined.
array_buffer.extend_from_slice(bytes);
} else {
utf8_validation_buffer.extend_from_slice(bytes);
}
// # Safety
// The buffer_id is the last buffer in the output buffers
// The offset is calculated from the buffer, so it is valid
// Utf-8 validation is done later
unsafe {
output.append_raw_view_unchecked(&view);
}
Ok(())
})?;
check_valid_utf8(&array_buffer)?;
check_valid_utf8(&utf8_validation_buffer)?;
v
};

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

1 participant