-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate ArrayData Buffer Alignment and Automatically Align IPC buffers (#4255) #4681
Validate ArrayData Buffer Alignment and Automatically Align IPC buffers (#4255) #4681
Conversation
} | ||
t => unreachable!("Data type {:?} either unsupported or not primitive", t), | ||
}; | ||
|
||
Ok(make_array(array_data)) | ||
} | ||
|
||
/// Checks if given `Buffer` is properly aligned with `T`. | ||
/// If not, copying the data and padded it for alignment. | ||
fn get_aligned_buffer<T>(buffer: &Buffer, length: usize) -> Buffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was originally added because aarch64 has stronger alignment requirements for i128 than the 64-bit alignment commonly provided by flatbuffers, now this is handled systematically for all types
arrow-data/src/data.rs
Outdated
|
||
if buffer.len() < min_buffer_size { | ||
return Err(ArrowError::InvalidArgumentError(format!( | ||
"Need at least {} bytes in buffers[{}] in array of type {:?}, but got {}", | ||
min_buffer_size, i, self.data_type, buffer.len() | ||
))); | ||
} | ||
|
||
if buffer.as_ptr().align_offset(*alignment) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now actually verify this as part of ArrayData, instead of panicking when you try to create an array 🎉
buffers: vec![BufferSpec::FixedWidth { byte_width }], | ||
buffers: vec![BufferSpec::FixedWidth { | ||
byte_width: mem::size_of::<T>(), | ||
alignment: mem::align_of::<T>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One consequence of this formulation is the alignment is architecture specific, I suspect this is fine, but it potentially worth highlighting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well let's hope nobody mmap
s arrow data between different architectures w/o carefully thinking about alignment.
On that note: I wonder what the cost would be to heavily overalign buffers in general to to make them valid on all supported architectures. Like why not just align everything to 256bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like why not just align everything to 256bits
We do, Rust doesn't 😄 And at least until the allocator API is stabilised there is no way to over-align a Vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are aligning all buffers to native sizes -- I thought the point was that the arrow spec said we should align buffers to 64 bits -- if that is the case, shouldn't this code reflect the spec, not the native size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because native alignment is what the arrow-rs implementation requires, for many types this is less than 64-bits. We don't, for example, care if the values buffer of a StringArray is aligned at all. On the flip side, some architectures, such as aarch64 require more than 64-bit alignment for i128
, and so we require that also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the documentation added in ff6af65 make the behavior in this PR much clearer.
Thank you
let array_data = unsafe { | ||
ArrayData::builder(DataType::Int32) | ||
.add_buffer(buf2) | ||
.build_unchecked() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to use build_unchecked here, as ArrayData will now complain about the alignment 🎉
arrow-data/src/data.rs
Outdated
|
||
if buffer.as_ptr().align_offset(*alignment) != 0 { | ||
return Err(ArrowError::InvalidArgumentError(format!( | ||
"Misaligned buffers[{i}] in array of type {:?}, expected {alignment}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful for debugging to include the actual alignment in addition to the expected value.
It appears something similar was done in arrow2 recently: jorgecarleitao/arrow2#1535 |
for (buffer, spec) in self.buffers.iter_mut().zip(&layout.buffers) { | ||
if let BufferSpec::FixedWidth { alignment, .. } = spec { | ||
if buffer.as_ptr().align_offset(*alignment) != 0 { | ||
*buffer = Buffer::from_slice_ref(buffer.as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we maybe log a debug message when this happens? Some users might find it confusing that the buffers are silently copied / re-aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently have a logging setup #1495
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do we consider to use a feature for automatic alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I read this PR, if a user is creating ArrayData
directly, they have to explicitly call align_buffer
in order to ensure the data is aligned and the documentation says it may copy the data.
If they are using the IPC reader then the reader may automatically align (by copying) the data, which seems better than erroring.
As a user, if I didn't want buffers realigned automatically, it would make probably want a setting on the IPC reader that said "error rather than silently fix" on unaligned data. That seems like something we could add as a follow on PR if someone wants that behavior
That is for FFI, which I don't intend to make automatic, as it would likely be very surprising for users. Instead the fact we provide an escape valve in the form of |
@@ -56,7 +56,9 @@ fn test_bad_number_of_buffers() { | |||
} | |||
|
|||
#[test] | |||
#[should_panic(expected = "integer overflow computing min buffer size")] | |||
#[should_panic( | |||
expected = "Need at least 18446744073709551615 bytes in buffers[0] in array of type Int64, but got 8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an allocation cannot be larger than isize, we might as well just saturate at usize::MAX, we're going to fail anyway
arrow-data/src/data.rs
Outdated
/// | ||
/// This can be useful for when interacting with data sent over IPC or FFI, that may | ||
/// not meet the minimum alignment requirements | ||
pub fn align_buffers(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so can we use this to address unaligned imported arrays issue like #4431?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It seems that it might take longer for a fix at Java Arrow implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also please document clearly
- What minimum alignment requirements are meant (is it Rust? Is it Arrow?)
- What happens if we have "unaligned buffers"? (Rust errors / panics? Less performance?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me. I am also going to run the tests on an ARM processor to verify it is ok. Given the choice between copying to realign and runtime error, I think copying to realign is a reasonable behavior change
As I mentioned previously, it may be useful to add a "error rather than copy mode" but we can do that if anyone else thinks that is valuable
I think the documentation could be more clear about what alignment we are referring to and what happens when data is not aligned.
But overall thank you @tustvold -- I think this is a significant step forward
for (buffer, spec) in self.buffers.iter_mut().zip(&layout.buffers) { | ||
if let BufferSpec::FixedWidth { alignment, .. } = spec { | ||
if buffer.as_ptr().align_offset(*alignment) != 0 { | ||
*buffer = Buffer::from_slice_ref(buffer.as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I read this PR, if a user is creating ArrayData
directly, they have to explicitly call align_buffer
in order to ensure the data is aligned and the documentation says it may copy the data.
If they are using the IPC reader then the reader may automatically align (by copying) the data, which seems better than erroring.
As a user, if I didn't want buffers realigned automatically, it would make probably want a setting on the IPC reader that said "error rather than silently fix" on unaligned data. That seems like something we could add as a follow on PR if someone wants that behavior
@@ -1629,7 +1656,7 @@ impl DataTypeLayout { | |||
#[derive(Debug, PartialEq, Eq)] | |||
pub enum BufferSpec { | |||
/// each element has a fixed width | |||
FixedWidth { byte_width: usize }, | |||
FixedWidth { byte_width: usize, alignment: usize }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should document here what alignment
represents (e.g. required pointer alignment for buffers of this type)
buffers: vec![BufferSpec::FixedWidth { byte_width }], | ||
buffers: vec![BufferSpec::FixedWidth { | ||
byte_width: mem::size_of::<T>(), | ||
alignment: mem::align_of::<T>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are aligning all buffers to native sizes -- I thought the point was that the arrow spec said we should align buffers to 64 bits -- if that is the case, shouldn't this code reflect the spec, not the native size?
arrow-data/src/data.rs
Outdated
/// | ||
/// This can be useful for when interacting with data sent over IPC or FFI, that may | ||
/// not meet the minimum alignment requirements | ||
pub fn align_buffers(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also please document clearly
- What minimum alignment requirements are meant (is it Rust? Is it Arrow?)
- What happens if we have "unaligned buffers"? (Rust errors / panics? Less performance?)
@@ -1704,4 +1648,40 @@ mod tests { | |||
let output_batch = roundtrip_ipc_stream(&input_batch); | |||
assert_eq!(input_batch, output_batch); | |||
} | |||
|
|||
#[test] | |||
fn test_unaligned() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this test without the changes in the PR and it fails like this:
---- reader::tests::test_unaligned stdout ----
thread 'reader::tests::test_unaligned' panicked at 'Memory pointer is not aligned with the specified scalar type', /Users/alamb/Software/arrow-rs2/arrow-buffer/src/buffer/scalar.rs:125:42
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also verified this PR and all tests pass on an ARM processor ("Ampere Altra" on GCP) 👍
I've added a number of additional doc comments to hopefully clarify what alignment means in this context |
Which issue does this PR close?
Closes #4255
Rationale for this change
This will allow zero-copy IPC (#4254), whilst also acting as an escape valve for issues like apache/arrow#32276 where arrow-cpp produces unaligned buffers
What changes are included in this PR?
Are there any user-facing changes?