-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow constructing ByteViewArray from existing blocks #5796
Conversation
/// Try to append a view of the given `block`, `offset` and `length` | ||
/// | ||
/// See [`Self::append_block`] | ||
pub fn try_append_view(&mut self, block: u32, offset: u32, len: u32) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what performance impact the validation logic here will have, but we can always add an unchecked version down the line should it become a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we have a filter benchmark but not a raw array creation speed benchmark
arrow-rs/arrow/src/util/bench_util.rs
Line 141 in 9828bf0
pub fn create_string_view_array_with_len( |
I agree let's start like this and then add benchmarks (like reading from parquet) and if they show slow downs we can add unchecked versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks like a good API to me
cc @ariesdevil
/// | ||
/// # Append Values | ||
/// | ||
/// To avoid bump allocating this builder allocates data in fixed size blocks, configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// To avoid bump allocating this builder allocates data in fixed size blocks, configurable | |
/// To avoid bump allocating, this builder allocates data in fixed size blocks, configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 looks great to me. Nice work @tustvold
let mut v = StringViewBuilder::new(); | ||
assert_eq!(v.append_block(b1), 0); | ||
|
||
v.append_value("This is a very long string that exceeds the inline length"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are appended to the current block (0
) right?
] | ||
); | ||
|
||
let err = v.try_append_view(0, u32::MAX, 1).unwrap_err(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also add an error test for an invalid block ID? (aka "No block found with index {block}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue does this PR close?
Relates to #5736
Relates to #5530
Relates to #5735
Rationale for this change
Whilst working on #5736 I struggled to devise a coherent interface for constructing byte views, because views can't really be constructed independently of the data buffers. In particular small strings need to be inlined in the view, but longer strings need to be added to a data buffer. As a result any interface that exposes the view abstraction is naturally leaky, and quite fiddly to use correctly.
Fortunately we already have a builder that abstracts away the view shenanigans, and with some minor tweaks we can extend it to allow using existing buffers, I think this provides for a much more coherent abstraction for constructing byte view arrays.
I think we should still proceed with making the buffer views typed, i.e. #5736, but this simplifies this to be a read-focused abstraction.
What changes are included in this PR?
Are there any user-facing changes?