Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the offsets checking #1684

Merged
merged 9 commits into from
May 15, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented May 10, 2022

Signed-off-by: remzi [email protected]

Which issue does this PR close?

Closes #1675.
Re #1620.

Rationale for this change

Originally, in each iteration, we fully check start_offset and end_offset, which means that all elements (except the first and the last) in the offsets_buffer are checked twice. This is somewhat wasteful.

What changes are included in this PR?

  1. Hoist some checking outside the window iteration.
  2. Update some error messages.
  3. Add docs and tests for empty binary-like array. (As Arrow tends to be ambiguous in this case. I keep the current implementation. We can change it in the future, or in Ensure there is a single zero in the offsets buffer for an empty ListArray. #1620).
  4. Add a benchmark for offsets checking.

Are there any user-facing changes?

No.

Benchmark result

Gnuplot not found, using plotters backend
validate_binary_array_data 20000                                                                             
                        time:   [35.595 us 35.650 us 35.712 us]
                        change: [-17.373% -17.034% -16.724%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 10, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 10, 2022

Codecov Report

Merging #1684 (8f07d64) into master (e0a527b) will increase coverage by 0.01%.
The diff coverage is 93.44%.

@@            Coverage Diff             @@
##           master    #1684      +/-   ##
==========================================
+ Coverage   83.12%   83.14%   +0.01%     
==========================================
  Files         193      193              
  Lines       55850    56206     +356     
==========================================
+ Hits        46426    46733     +307     
- Misses       9424     9473      +49     
Impacted Files Coverage Δ
arrow/src/array/data.rs 84.35% <93.44%> (+1.29%) ⬆️
parquet/src/arrow/array_reader/test_util.rs 81.08% <0.00%> (-2.26%) ⬇️
arrow/src/datatypes/datatype.rs 65.45% <0.00%> (-1.35%) ⬇️
arrow/src/compute/kernels/length.rs 99.04% <0.00%> (-0.96%) ⬇️
arrow/src/compute/kernels/substring.rs 99.36% <0.00%> (-0.64%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/ipc/reader.rs 88.80% <0.00%> (-0.23%) ⬇️
arrow/src/compute/kernels/take.rs 95.27% <0.00%> (-0.14%) ⬇️
parquet/src/arrow/array_reader.rs 89.76% <0.00%> (-0.07%) ⬇️
arrow-flight/src/utils.rs 0.00% <0.00%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0a527b...8f07d64. Read the comment docs.

}
})
.collect::<Result<Vec<usize>>>()?;
Copy link
Contributor Author

@HaoYang670 HaoYang670 May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect the result here, because I want to use the windows method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of calling collect() is that it will buffer the intermediate result (aka allocate memory and save the offsets in another buffer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has replaced with iter::scan, no collect now!

#[test]
fn test_empty_utf8_array_with_non_zero_offset() {
let data_buffer = Buffer::from_slice_ref(&"abcdef".as_bytes());
let offsets_buffer = Buffer::from_slice_ref(&[0i32, 2, 6, 0]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the offsets are weird but reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree this is strange but ok

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HaoYang670

This code has lots of test coverage so I feel good about the change functionally

I know you mentioned that one of the rationales for the change was speeding things up, but given the new collect() and allocation I wonder if this is actually faster or if it could actually be slower. It might be worth some measurements

#[test]
fn test_empty_utf8_array_with_non_zero_offset() {
let data_buffer = Buffer::from_slice_ref(&"abcdef".as_bytes());
let offsets_buffer = Buffer::from_slice_ref(&[0i32, 2, 6, 0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree this is strange but ok

}
})
.collect::<Result<Vec<usize>>>()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of calling collect() is that it will buffer the intermediate result (aka allocate memory and save the offsets in another buffer)

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks reasonable. Agree with @alamb that a performance measurement may be good as performance is one motivation.

arrow/src/array/data.rs Outdated Show resolved Hide resolved
@HaoYang670
Copy link
Contributor Author

Thank you for your review @alamb @viirya. I will add a benchmark for offsets checking. Also, I find that using iter::scan is a better way than using slice::windows, which can avoid collecting.

HaoYang670 and others added 3 commits May 11, 2022 10:52
fix a nit in docs.

Co-authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: remzi <[email protected]>
@HaoYang670
Copy link
Contributor Author

Benchmark result has been added!

arrow/src/array/data.rs Outdated Show resolved Hide resolved
Co-authored-by: Liang-Chi Hsieh <[email protected]>
arrow/src/array/data.rs Outdated Show resolved Hide resolved
Co-authored-by: Liang-Chi Hsieh <[email protected]>
"Offset invariant failure: non-monotonic offset at slot {}: {} > {}",
i, start_offset, end_offset))
);
i - 1, start, end))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The i - 1 here is a little ugly. I try to find a more elegant way

@viirya
Copy link
Member

viirya commented May 11, 2022

Thanks @HaoYang670. The performance numbers look good!

@HaoYang670
Copy link
Contributor Author

Actually, I think we could do more. Because there is some redundant checking (such as comparing each offset with offset_limit). However, as a number of tests rely on that checking, I do not remove it.

@HaoYang670
Copy link
Contributor Author

Could we merge this PR?

@HaoYang670 HaoYang670 changed the title simplify offsets checking Speed up the offsets checking May 15, 2022
@viirya viirya merged commit ff182f1 into apache:master May 15, 2022
@viirya
Copy link
Member

viirya commented May 15, 2022

Merged, thanks @HaoYang670 @alamb

@HaoYang670 HaoYang670 deleted the simplity_offsets_checking branch June 1, 2022 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up the offsets checking
4 participants