ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331

sunchao · 2019-05-16T23:32:44Z

No description provided.

sunchao · 2019-05-17T05:29:33Z

After discussing with @paddyhoran on the JIRA, I'm now inclined to keep both len and capacity in Buffer, which allows us to still maintain most of the existing functionalities. To fix the flatbuffer conversion issue, we'll need ARROW-5358 which allows us to compare ArrayData properly.

@nevi-me let me know if this sounds good to you.

nevi-me · 2019-05-17T05:33:45Z

I saw the conversation, would we be able to use capacity when writing to IPC from the current Buffer?

sunchao · 2019-05-17T05:35:12Z

Yes, both reading and writing to IPC buffer will use capacity instead of len.

nevi-me · 2019-05-22T05:16:59Z

Apologies for not responding sooner @sunchao

It's only mutable buffer that has capacity, and when we freeze it to create a buffer, we 'lose' the capacity.

#[derive(PartialEq, Debug)]
pub struct Buffer {
    /// Reference-counted pointer to the internal byte buffer.
    data: Arc<BufferData>,

    /// The offset into the buffer.
    offset: usize,
}

struct BufferData {
    ptr: *const u8,
    len: usize,
    // no capacity
}

sunchao · 2019-05-22T05:21:46Z

@nevi-me yes that's right. As I mentioned above, I think it's better to keep both len and capacity in Buffer (same as MutableBuffer).

I'm currently working on https://issues.apache.org/jira/browse/ARROW-5358 which hopefully will allow us to compare Array and ArrayData. It should be able to help resolve the issue you are seeing. I plan to come back to this after that is done.

nevi-me · 2019-05-22T05:35:16Z

Okay, great! I'll keep #4167 open, and continue #4330 without ArrayData comparisons for now

codecov-io · 2019-07-28T23:34:07Z

Codecov Report

Merging #4331 into master will decrease coverage by 4.92%.
The diff coverage is 88.88%.

@@             Coverage Diff             @@
##           master    #4331       +/-   ##
===========================================
- Coverage    87.5%   82.57%    -4.93%     
===========================================
  Files         998       86      -912     
  Lines      141784    24935   -116849     
  Branches     1418        0     -1418     
===========================================
- Hits       124065    20591   -103474     
+ Misses      17357     4344    -13013     
+ Partials      362        0      -362

Impacted Files	Coverage Δ
rust/arrow/src/memory.rs	`100% <100%> (ø)`	⬆️
rust/arrow/src/array/array.rs	`92.57% <100%> (ø)`	⬆️
rust/arrow/src/tensor.rs	`92.3% <80%> (-0.48%)`	⬇️
rust/arrow/src/buffer.rs	`91.3% <88.88%> (+1.52%)`	⬆️
python/pyarrow/ipc.pxi
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
r/src/symbols.cpp
... and 907 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38b0176...7eac173. Read the comment docs.

paddyhoran · 2019-09-09T17:13:13Z

@nevi-me @sunchao I know there was some related discussion here but are we ok to move forward with this PR (i.e. adding capacity to Buffer). Is this ready for review?

sunchao · 2019-09-09T23:08:59Z

I think it is ready for review. It'd be great if you can take a look.

paddyhoran · 2019-09-09T23:27:30Z

Yep, sounds good. I'll take a look when I get a chance.

paddyhoran

LGTM, just a few nits.

paddyhoran · 2019-09-10T01:50:51Z

rust/arrow/src/buffer.rs

 }

 impl PartialEq for BufferData {
    fn eq(&self, other: &BufferData) -> bool {
        if self.len != other.len {
            return false;
        }
+        if self.capacity != other.capacity {


I'm not against the current implementation, but I wonder if we should only compare the "meaningful" data?

I agree. It seems wrong that on one hand the memcmp is comparing up to len and we bypass that if there is a mismatch on capacity

Thanks guys! I'm slightly confused - here we are comparing meaningful data, no? and if capacity mismatch then it is considered not equal and we skip comparing the data content.

The equality is defined as: 1) both length should be equal, 2) both capacity should be equal, and 3) data content up to length should be equal. Let me know if this definition sounds good to you.

To me if two arrays only differ by the amount of padding that they have then I would consider them equal. When I perform operations using these two arrays I will get the same answer (because the padding, or rather amount of padding, does not impact the result). However:

I am focused on computation, maybe there are other implication in IPC, etc.

In practice, this probably won't come up as we round up padding to a multiple of 64 bytes, but it could happen.

Array equality is defined separately in array/equal.rs and yes, it does take account on what you said (it compares buffer content with data()). In the current context we are discussing equality of buffers though, which IMO, when looking at in isolation, should consider both len and capacity.

Ok, yes I agree.

paddyhoran · 2019-09-10T01:55:58Z

rust/arrow/src/buffer.rs

@@ -92,12 +102,12 @@ impl Debug for BufferData {

 impl Buffer {
    /// Creates a buffer from an existing memory region (must already be byte-aligned)
-    pub fn from_raw_parts(ptr: *const u8, len: usize) -> Self {
+    pub fn from_raw_parts(ptr: *const u8, len: usize, capacity: usize) -> Self {
        assert!(
            memory::is_aligned(ptr, memory::ALIGNMENT),
            "memory not aligned"
        );


We should also check that we are padded to 64 bytes. We assume this in the SIMD implementations and it's recommended here.

We should probably test for this also.

andygrove · 2019-09-10T13:51:20Z

rust/arrow/src/buffer.rs

 }

 impl PartialEq for BufferData {
    fn eq(&self, other: &BufferData) -> bool {
        if self.len != other.len {
            return false;
        }
+        if self.capacity != other.capacity {


I agree. It seems wrong that on one hand the memcmp is comparing up to len and we bypass that if there is a mismatch on capacity

andygrove · 2019-09-10T13:53:00Z

rust/arrow/src/buffer.rs

@@ -471,6 +487,9 @@ impl PartialEq for MutableBuffer {
        if self.len != other.len {
            return false;
        }
+        if self.capacity != other.capacity {


Same issue .. memcmp compares up to len so it doesn't matter if the capacities match or not

nevi-me · 2019-12-29T22:13:25Z

I initially thought this would be a blocker for IPC, but I've managed to implement the reader and writer without this. I'd vote for closing this PR @sunchao.

paddyhoran · 2019-12-30T18:55:37Z

I'd vote for closing this PR @sunchao.

I would actually like to see this PR merged (maybe after further review). Length and capacity are two distinct concepts in Arrow and having capacity be explicit is easier to understand especially for newcomers.

I also think it will be useful when converting from other libraries to Arrow. In these cases, we will have to check the alignment and padding (capacity) of the data that is passed to us.

sunchao · 2019-12-31T17:16:08Z

Yes I think it is still useful to have size and capacity as two separate properties of buffer. @andygrove could you take another look on this? I think your concern has been addressed via the comments.

emkornfield · 2020-02-03T03:26:44Z

@andygrove want to take look?

paddyhoran · 2020-02-14T21:23:21Z

@sunchao can you rebase this if you get a chance (or let me know if you are stuck for time and I will)? It turns out that this PR fixes UB as pointed out by @jturner314 on #6397.

I'll re-review early next week and hopefully we can get this merged.

…d of used bytes

sunchao · 2020-02-15T17:59:05Z

Sure @paddyhoran . Updated.

paddyhoran

LGTM - @andygrove you requested changes previously, do you want to re-review or can I merge this one?

andygrove · 2020-02-20T05:44:07Z

Apologies for missing the earlier mentions. I really need to figure out a better way of managing all the emails I get so that these mentions stand out.

wesm · 2020-02-20T11:32:37Z

@andygrove the easiest thing is to set up a gmail filter that sends any GitHub e-mail with "you were mentioned" directly to Inbox (vice versa, to auto-archive any GitHub e-mail that does not contain this). A separate filter rule should be used to apply a label unconditionally to Arrow-related GitHub e-mails

sunchao force-pushed the ARROW-5357 branch from 2b22921 to bba3d1e Compare May 16, 2019 23:34

sunchao changed the title ~~Arrow 5357~~ ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes May 16, 2019

sunchao added the Component: Rust label May 16, 2019

nevi-me approved these changes May 17, 2019

View reviewed changes

wesm mentioned this pull request Jun 12, 2019

ARROW-5351: [Rust] Take kernel #4330

Closed

emkornfield force-pushed the ARROW-5357 branch from bba3d1e to dee57b7 Compare July 5, 2019 09:07

kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29

sunchao force-pushed the ARROW-5357 branch from dee57b7 to 7eac173 Compare July 28, 2019 22:13

paddyhoran reviewed Sep 10, 2019

View reviewed changes

andygrove requested changes Sep 10, 2019

View reviewed changes

kszucs force-pushed the ARROW-5357 branch from 7eac173 to 9c7e657 Compare October 5, 2019 10:17

paddyhoran mentioned this pull request Feb 12, 2020

ARROW-7624: [Rust] Soundness issues via Buffer methods #6397

Closed

sunchao added 2 commits February 15, 2020 09:52

ARROW-5357: [Rust] Change Buffer::len to represent total bytes instea…

d9be1df

…d of used bytes

Add capacity in Buffer struct

1ab4ccc

sunchao force-pushed the ARROW-5357 branch from 9c7e657 to 1ab4ccc Compare February 15, 2020 17:58

sunchao added 2 commits February 15, 2020 15:51

Fix rebase error

bbb30cb

Fix Bitmap equality

2169ec5

paddyhoran approved these changes Feb 20, 2020

View reviewed changes

andygrove approved these changes Feb 20, 2020

View reviewed changes

paddyhoran closed this in 28ec94c Feb 21, 2020

paddyhoran mentioned this pull request Nov 4, 2020

ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590

Closed

asfimport mentioned this pull request Feb 21, 2020

[Rust] Add capacity field in Buffer #21816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331

ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331

sunchao commented May 16, 2019

sunchao commented May 17, 2019

nevi-me commented May 17, 2019

sunchao commented May 17, 2019

nevi-me commented May 22, 2019

sunchao commented May 22, 2019

nevi-me commented May 22, 2019

codecov-io commented Jul 28, 2019

paddyhoran commented Sep 9, 2019

sunchao commented Sep 9, 2019

paddyhoran commented Sep 9, 2019

paddyhoran left a comment

paddyhoran Sep 10, 2019

andygrove Sep 10, 2019

sunchao Sep 10, 2019

paddyhoran Sep 10, 2019

sunchao Sep 10, 2019

paddyhoran Sep 11, 2019

paddyhoran Sep 10, 2019

paddyhoran Sep 10, 2019

andygrove Sep 10, 2019

andygrove Sep 10, 2019

nevi-me commented Dec 29, 2019

paddyhoran commented Dec 30, 2019

sunchao commented Dec 31, 2019

emkornfield commented Feb 3, 2020

paddyhoran commented Feb 14, 2020

sunchao commented Feb 15, 2020

paddyhoran left a comment

andygrove commented Feb 20, 2020

wesm commented Feb 20, 2020

ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331

ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331

Conversation

sunchao commented May 16, 2019

sunchao commented May 17, 2019

nevi-me commented May 17, 2019

sunchao commented May 17, 2019

nevi-me commented May 22, 2019

sunchao commented May 22, 2019

nevi-me commented May 22, 2019

codecov-io commented Jul 28, 2019

Codecov Report

paddyhoran commented Sep 9, 2019

sunchao commented Sep 9, 2019

paddyhoran commented Sep 9, 2019

paddyhoran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me commented Dec 29, 2019

paddyhoran commented Dec 30, 2019

sunchao commented Dec 31, 2019

emkornfield commented Feb 3, 2020

paddyhoran commented Feb 14, 2020

sunchao commented Feb 15, 2020

paddyhoran left a comment

Choose a reason for hiding this comment

andygrove commented Feb 20, 2020

wesm commented Feb 20, 2020