Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate ArrayData Buffer Alignment and Automatically Align IPC buffers (#4255) #4681

Merged
merged 6 commits into from
Aug 16, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4255

Rationale for this change

This will allow zero-copy IPC (#4254), whilst also acting as an escape valve for issues like apache/arrow#32276 where arrow-cpp produces unaligned buffers

What changes are included in this PR?

Are there any user-facing changes?

@tustvold tustvold added the api-change Changes to the arrow API label Aug 11, 2023
@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 11, 2023
}
t => unreachable!("Data type {:?} either unsupported or not primitive", t),
};

Ok(make_array(array_data))
}

/// Checks if given `Buffer` is properly aligned with `T`.
/// If not, copying the data and padded it for alignment.
fn get_aligned_buffer<T>(buffer: &Buffer, length: usize) -> Buffer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally added because aarch64 has stronger alignment requirements for i128 than the 64-bit alignment commonly provided by flatbuffers, now this is handled systematically for all types


if buffer.len() < min_buffer_size {
return Err(ArrowError::InvalidArgumentError(format!(
"Need at least {} bytes in buffers[{}] in array of type {:?}, but got {}",
min_buffer_size, i, self.data_type, buffer.len()
)));
}

if buffer.as_ptr().align_offset(*alignment) != 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now actually verify this as part of ArrayData, instead of panicking when you try to create an array 🎉

buffers: vec![BufferSpec::FixedWidth { byte_width }],
buffers: vec![BufferSpec::FixedWidth {
byte_width: mem::size_of::<T>(),
alignment: mem::align_of::<T>(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One consequence of this formulation is the alignment is architecture specific, I suspect this is fine, but it potentially worth highlighting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well let's hope nobody mmaps arrow data between different architectures w/o carefully thinking about alignment.

On that note: I wonder what the cost would be to heavily overalign buffers in general to to make them valid on all supported architectures. Like why not just align everything to 256bits?

Copy link
Contributor Author

@tustvold tustvold Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like why not just align everything to 256bits

We do, Rust doesn't 😄 And at least until the allocator API is stabilised there is no way to over-align a Vec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are aligning all buffers to native sizes -- I thought the point was that the arrow spec said we should align buffers to 64 bits -- if that is the case, shouldn't this code reflect the spec, not the native size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because native alignment is what the arrow-rs implementation requires, for many types this is less than 64-bits. We don't, for example, care if the values buffer of a StringArray is aligned at all. On the flip side, some architectures, such as aarch64 require more than 64-bit alignment for i128, and so we require that also

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the documentation added in ff6af65 make the behavior in this PR much clearer.

Thank you

let array_data = unsafe {
ArrayData::builder(DataType::Int32)
.add_buffer(buf2)
.build_unchecked()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to use build_unchecked here, as ArrayData will now complain about the alignment 🎉

@tustvold tustvold changed the title Automatically align misaligned IPC buffers (#4255) Validate ArrayData Bufer Alignment and Automatically Align IPC buffers (#4255) Aug 11, 2023
@tustvold tustvold changed the title Validate ArrayData Bufer Alignment and Automatically Align IPC buffers (#4255) Validate ArrayData Buffer Alignment and Automatically Align IPC buffers (#4255) Aug 11, 2023

if buffer.as_ptr().align_offset(*alignment) != 0 {
return Err(ArrowError::InvalidArgumentError(format!(
"Misaligned buffers[{i}] in array of type {:?}, expected {alignment}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful for debugging to include the actual alignment in addition to the expected value.

@alamb
Copy link
Contributor

alamb commented Aug 11, 2023

It appears something similar was done in arrow2 recently: jorgecarleitao/arrow2#1535

for (buffer, spec) in self.buffers.iter_mut().zip(&layout.buffers) {
if let BufferSpec::FixedWidth { alignment, .. } = spec {
if buffer.as_ptr().align_offset(*alignment) != 0 {
*buffer = Buffer::from_slice_ref(buffer.as_ref())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe log a debug message when this happens? Some users might find it confusing that the buffers are silently copied / re-aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently have a logging setup #1495

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we consider to use a feature for automatic alignment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read this PR, if a user is creating ArrayData directly, they have to explicitly call align_buffer in order to ensure the data is aligned and the documentation says it may copy the data.

If they are using the IPC reader then the reader may automatically align (by copying) the data, which seems better than erroring.

As a user, if I didn't want buffers realigned automatically, it would make probably want a setting on the IPC reader that said "error rather than silently fix" on unaligned data. That seems like something we could add as a follow on PR if someone wants that behavior

@tustvold
Copy link
Contributor Author

t appears something similar was done in arrow2 recently: jorgecarleitao/arrow2#1535

That is for FFI, which I don't intend to make automatic, as it would likely be very surprising for users. Instead the fact we provide an escape valve in the form of ArrayData::align_buffers I think is sufficient

@@ -56,7 +56,9 @@ fn test_bad_number_of_buffers() {
}

#[test]
#[should_panic(expected = "integer overflow computing min buffer size")]
#[should_panic(
expected = "Need at least 18446744073709551615 bytes in buffers[0] in array of type Int64, but got 8"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an allocation cannot be larger than isize, we might as well just saturate at usize::MAX, we're going to fail anyway

///
/// This can be useful for when interacting with data sent over IPC or FFI, that may
/// not meet the minimum alignment requirements
pub fn align_buffers(&mut self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so can we use this to address unaligned imported arrays issue like #4431?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It seems that it might take longer for a fix at Java Arrow implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also please document clearly

  1. What minimum alignment requirements are meant (is it Rust? Is it Arrow?)
  2. What happens if we have "unaligned buffers"? (Rust errors / panics? Less performance?)

@tustvold tustvold requested a review from alamb August 15, 2023 07:49
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me. I am also going to run the tests on an ARM processor to verify it is ok. Given the choice between copying to realign and runtime error, I think copying to realign is a reasonable behavior change

As I mentioned previously, it may be useful to add a "error rather than copy mode" but we can do that if anyone else thinks that is valuable

I think the documentation could be more clear about what alignment we are referring to and what happens when data is not aligned.

But overall thank you @tustvold -- I think this is a significant step forward

for (buffer, spec) in self.buffers.iter_mut().zip(&layout.buffers) {
if let BufferSpec::FixedWidth { alignment, .. } = spec {
if buffer.as_ptr().align_offset(*alignment) != 0 {
*buffer = Buffer::from_slice_ref(buffer.as_ref())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read this PR, if a user is creating ArrayData directly, they have to explicitly call align_buffer in order to ensure the data is aligned and the documentation says it may copy the data.

If they are using the IPC reader then the reader may automatically align (by copying) the data, which seems better than erroring.

As a user, if I didn't want buffers realigned automatically, it would make probably want a setting on the IPC reader that said "error rather than silently fix" on unaligned data. That seems like something we could add as a follow on PR if someone wants that behavior

@@ -1629,7 +1656,7 @@ impl DataTypeLayout {
#[derive(Debug, PartialEq, Eq)]
pub enum BufferSpec {
/// each element has a fixed width
FixedWidth { byte_width: usize },
FixedWidth { byte_width: usize, alignment: usize },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document here what alignment represents (e.g. required pointer alignment for buffers of this type)

buffers: vec![BufferSpec::FixedWidth { byte_width }],
buffers: vec![BufferSpec::FixedWidth {
byte_width: mem::size_of::<T>(),
alignment: mem::align_of::<T>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are aligning all buffers to native sizes -- I thought the point was that the arrow spec said we should align buffers to 64 bits -- if that is the case, shouldn't this code reflect the spec, not the native size?

///
/// This can be useful for when interacting with data sent over IPC or FFI, that may
/// not meet the minimum alignment requirements
pub fn align_buffers(&mut self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also please document clearly

  1. What minimum alignment requirements are meant (is it Rust? Is it Arrow?)
  2. What happens if we have "unaligned buffers"? (Rust errors / panics? Less performance?)

@@ -1704,4 +1648,40 @@ mod tests {
let output_batch = roundtrip_ipc_stream(&input_batch);
assert_eq!(input_batch, output_batch);
}

#[test]
fn test_unaligned() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this test without the changes in the PR and it fails like this:

---- reader::tests::test_unaligned stdout ----
thread 'reader::tests::test_unaligned' panicked at 'Memory pointer is not aligned with the specified scalar type', /Users/alamb/Software/arrow-rs2/arrow-buffer/src/buffer/scalar.rs:125:42
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also verified this PR and all tests pass on an ARM processor ("Ampere Altra" on GCP) 👍

@tustvold
Copy link
Contributor Author

I've added a number of additional doc comments to hopefully clarify what alignment means in this context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Misaligned IPC Buffers
4 participants