-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331
Conversation
After discussing with @paddyhoran on the JIRA, I'm now inclined to keep both @nevi-me let me know if this sounds good to you. |
I saw the conversation, would we be able to use capacity when writing to IPC from the current |
Yes, both reading and writing to IPC buffer will use |
Apologies for not responding sooner @sunchao It's only mutable buffer that has capacity, and when we freeze it to create a buffer, we 'lose' the capacity. #[derive(PartialEq, Debug)]
pub struct Buffer {
/// Reference-counted pointer to the internal byte buffer.
data: Arc<BufferData>,
/// The offset into the buffer.
offset: usize,
}
struct BufferData {
ptr: *const u8,
len: usize,
// no capacity
} |
@nevi-me yes that's right. As I mentioned above, I think it's better to keep both I'm currently working on https://issues.apache.org/jira/browse/ARROW-5358 which hopefully will allow us to compare |
ed180da
to
85fe336
Compare
Codecov Report
@@ Coverage Diff @@
## master #4331 +/- ##
===========================================
- Coverage 87.5% 82.57% -4.93%
===========================================
Files 998 86 -912
Lines 141784 24935 -116849
Branches 1418 0 -1418
===========================================
- Hits 124065 20591 -103474
+ Misses 17357 4344 -13013
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
I think it is ready for review. It'd be great if you can take a look. |
Yep, sounds good. I'll take a look when I get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few nits.
} | ||
|
||
impl PartialEq for BufferData { | ||
fn eq(&self, other: &BufferData) -> bool { | ||
if self.len != other.len { | ||
return false; | ||
} | ||
if self.capacity != other.capacity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against the current implementation, but I wonder if we should only compare the "meaningful" data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It seems wrong that on one hand the memcmp is comparing up to len
and we bypass that if there is a mismatch on capacity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks guys! I'm slightly confused - here we are comparing meaningful data, no? and if capacity mismatch then it is considered not equal and we skip comparing the data content.
The equality is defined as: 1) both length should be equal, 2) both capacity should be equal, and 3) data content up to length should be equal. Let me know if this definition sounds good to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me if two arrays only differ by the amount of padding that they have then I would consider them equal. When I perform operations using these two arrays I will get the same answer (because the padding, or rather amount of padding, does not impact the result). However:
- I am focused on computation, maybe there are other implication in IPC, etc.
- In practice, this probably won't come up as we round up padding to a multiple of 64 bytes, but it could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array equality is defined separately in array/equal.rs
and yes, it does take account on what you said (it compares buffer content with data()
). In the current context we are discussing equality of buffers though, which IMO, when looking at in isolation, should consider both len
and capacity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yes I agree.
@@ -92,12 +102,12 @@ impl Debug for BufferData { | |||
|
|||
impl Buffer { | |||
/// Creates a buffer from an existing memory region (must already be byte-aligned) | |||
pub fn from_raw_parts(ptr: *const u8, len: usize) -> Self { | |||
pub fn from_raw_parts(ptr: *const u8, len: usize, capacity: usize) -> Self { | |||
assert!( | |||
memory::is_aligned(ptr, memory::ALIGNMENT), | |||
"memory not aligned" | |||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also check that we are padded to 64 bytes. We assume this in the SIMD implementations and it's recommended here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably test for this also.
} | ||
|
||
impl PartialEq for BufferData { | ||
fn eq(&self, other: &BufferData) -> bool { | ||
if self.len != other.len { | ||
return false; | ||
} | ||
if self.capacity != other.capacity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It seems wrong that on one hand the memcmp is comparing up to len
and we bypass that if there is a mismatch on capacity
@@ -471,6 +487,9 @@ impl PartialEq for MutableBuffer { | |||
if self.len != other.len { | |||
return false; | |||
} | |||
if self.capacity != other.capacity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue .. memcmp compares up to len
so it doesn't matter if the capacities match or not
I initially thought this would be a blocker for IPC, but I've managed to implement the reader and writer without this. I'd vote for closing this PR @sunchao. |
I would actually like to see this PR merged (maybe after further review). Length and capacity are two distinct concepts in Arrow and having capacity be explicit is easier to understand especially for newcomers. I also think it will be useful when converting from other libraries to Arrow. In these cases, we will have to check the alignment and padding (capacity) of the data that is passed to us. |
Yes I think it is still useful to have size and capacity as two separate properties of buffer. @andygrove could you take another look on this? I think your concern has been addressed via the comments. |
@andygrove want to take look? |
@sunchao can you rebase this if you get a chance (or let me know if you are stuck for time and I will)? It turns out that this PR fixes UB as pointed out by @jturner314 on #6397. I'll re-review early next week and hopefully we can get this merged. |
Sure @paddyhoran . Updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - @andygrove you requested changes previously, do you want to re-review or can I merge this one?
Apologies for missing the earlier mentions. I really need to figure out a better way of managing all the emails I get so that these mentions stand out. |
@andygrove the easiest thing is to set up a gmail filter that sends any GitHub e-mail with "you were mentioned" directly to Inbox (vice versa, to auto-archive any GitHub e-mail that does not contain this). A separate filter rule should be used to apply a label unconditionally to Arrow-related GitHub e-mails |
No description provided.