-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the offsets checking #1684
Conversation
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1684 +/- ##
==========================================
+ Coverage 83.12% 83.14% +0.01%
==========================================
Files 193 193
Lines 55850 56206 +356
==========================================
+ Hits 46426 46733 +307
- Misses 9424 9473 +49
Continue to review full report at Codecov.
|
rename to offsets buffer Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
arrow/src/array/data.rs
Outdated
} | ||
}) | ||
.collect::<Result<Vec<usize>>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collect the result here, because I want to use the windows
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of calling collect()
is that it will buffer the intermediate result (aka allocate memory and save the offsets in another buffer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has replaced with iter::scan
, no collect
now!
#[test] | ||
fn test_empty_utf8_array_with_non_zero_offset() { | ||
let data_buffer = Buffer::from_slice_ref(&"abcdef".as_bytes()); | ||
let offsets_buffer = Buffer::from_slice_ref(&[0i32, 2, 6, 0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the offsets are weird but reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree this is strange but ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HaoYang670
This code has lots of test coverage so I feel good about the change functionally
I know you mentioned that one of the rationales for the change was speeding things up, but given the new collect()
and allocation I wonder if this is actually faster or if it could actually be slower. It might be worth some measurements
#[test] | ||
fn test_empty_utf8_array_with_non_zero_offset() { | ||
let data_buffer = Buffer::from_slice_ref(&"abcdef".as_bytes()); | ||
let offsets_buffer = Buffer::from_slice_ref(&[0i32, 2, 6, 0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree this is strange but ok
arrow/src/array/data.rs
Outdated
} | ||
}) | ||
.collect::<Result<Vec<usize>>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of calling collect()
is that it will buffer the intermediate result (aka allocate memory and save the offsets in another buffer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks reasonable. Agree with @alamb that a performance measurement may be good as performance is one motivation.
fix a nit in docs. Co-authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Benchmark result has been added! |
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
"Offset invariant failure: non-monotonic offset at slot {}: {} > {}", | ||
i, start_offset, end_offset)) | ||
); | ||
i - 1, start, end)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i - 1
here is a little ugly. I try to find a more elegant way
Thanks @HaoYang670. The performance numbers look good! |
Actually, I think we could do more. Because there is some redundant checking (such as comparing each offset with |
Could we merge this PR? |
Merged, thanks @HaoYang670 @alamb |
Signed-off-by: remzi [email protected]
Which issue does this PR close?
Closes #1675.
Re #1620.
Rationale for this change
Originally, in each iteration, we fully check
start_offset
andend_offset
, which means that all elements (except the first and the last) in theoffsets_buffer
are checked twice. This is somewhat wasteful.What changes are included in this PR?
window
iteration.Are there any user-facing changes?
No.
Benchmark result