Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow constructing ByteViewArray from existing blocks #5796

Merged
merged 4 commits into from
May 29, 2024

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 23, 2024

Which issue does this PR close?

Relates to #5736
Relates to #5530
Relates to #5735

Rationale for this change

Whilst working on #5736 I struggled to devise a coherent interface for constructing byte views, because views can't really be constructed independently of the data buffers. In particular small strings need to be inlined in the view, but longer strings need to be added to a data buffer. As a result any interface that exposes the view abstraction is naturally leaky, and quite fiddly to use correctly.

Fortunately we already have a builder that abstracts away the view shenanigans, and with some minor tweaks we can extend it to allow using existing buffers, I think this provides for a much more coherent abstraction for constructing byte view arrays.

I think we should still proceed with making the buffer views typed, i.e. #5736, but this simplifies this to be a read-focused abstraction.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 23, 2024
/// Try to append a view of the given `block`, `offset` and `length`
///
/// See [`Self::append_block`]
pub fn try_append_view(&mut self, block: u32, offset: u32, len: u32) -> Result<(), ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what performance impact the validation logic here will have, but we can always add an unchecked version down the line should it become a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we have a filter benchmark but not a raw array creation speed benchmark

pub fn create_string_view_array_with_len(

I agree let's start like this and then add benchmarks (like reading from parquet) and if they show slow downs we can add unchecked versions

@tustvold tustvold marked this pull request as draft May 23, 2024 11:37
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks like a good API to me

cc @ariesdevil

///
/// # Append Values
///
/// To avoid bump allocating this builder allocates data in fixed size blocks, configurable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// To avoid bump allocating this builder allocates data in fixed size blocks, configurable
/// To avoid bump allocating, this builder allocates data in fixed size blocks, configurable

@tustvold tustvold marked this pull request as ready for review May 29, 2024 11:16
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 looks great to me. Nice work @tustvold

let mut v = StringViewBuilder::new();
assert_eq!(v.append_block(b1), 0);

v.append_value("This is a very long string that exceeds the inline length");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are appended to the current block (0) right?

]
);

let err = v.try_append_view(0, u32::MAX, 1).unwrap_err();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add an error test for an invalid block ID? (aka "No block found with index {block}")

Copy link
Contributor

@ariesdevil ariesdevil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tustvold tustvold merged commit 1634a65 into apache:master May 29, 2024
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants