-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add full data validation for ArrayData::try_new() #921
Add full data validation for ArrayData::try_new() #921
Conversation
fda948d
to
f968803
Compare
Codecov Report
@@ Coverage Diff @@
## master #921 +/- ##
==========================================
+ Coverage 82.29% 82.31% +0.01%
==========================================
Files 168 168
Lines 48761 49031 +270
==========================================
+ Hits 40129 40358 +229
- Misses 8632 8673 +41
Continue to review full report at Codecov.
|
30710a1
to
74b74dd
Compare
Update: I have a bunch of tests written, so that is good 🎉 I think there is some more testing needed here as when I tried to apply |
c416ad1
to
9d97916
Compare
6180179
to
6e6e410
Compare
This PR (that has full ArrayData validate) is now ready for review . I hope it allows us to close the currently open RUSTSEC entries for arrow-rs: https://github.com/rustsec/advisory-db/tree/main/crates/arrow cc @jhorstmann @bjchambers @jimexist @jorgecarleitao I know it is a large ask to review this PR, so thank you if you have time to do so. As for testing, if you have a project that uses |
Co-authored-by: Jörn Horstmann <[email protected]>
…ta_construction_validation
Thanks again everyone for the help. I'll backport this PR and hopefully we can get it shipped in 6.4.0 |
* Add full data validation for ArrayData::try_new() * Only look at offset+len indexes Co-authored-by: Jörn Horstmann <[email protected]> * fix test * fmt * test for array indexes Co-authored-by: Jörn Horstmann <[email protected]>
* Add full data validation for ArrayData::try_new() * Only look at offset+len indexes Co-authored-by: Jörn Horstmann <[email protected]> * fix test * fmt * test for array indexes Co-authored-by: Jörn Horstmann <[email protected]>
* Add full data validation for ArrayData::try_new() (#921) * Add full data validation for ArrayData::try_new() * Only look at offset+len indexes Co-authored-by: Jörn Horstmann <[email protected]> * fix test * fmt * test for array indexes Co-authored-by: Jörn Horstmann <[email protected]> * Fix: clippy Co-authored-by: Jörn Horstmann <[email protected]>
…ike arrays # Rationale The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: apache/arrow-rs#1071 and the existing [docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) seem to be somewhat vague on this issue. I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic: * C++ implementation (I think) also also ensures the offsets are monotonic without first checking the validity array https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L568-L592 * arrow-rs after apache/arrow-rs#921 (based on the C++) will refuse to create arrays where the array offsets are non monotonic * arrow2 also ensures that offsets are always monotonic. https://github.com/jorgecarleitao/arrow2/blob/37a9c758826a92d98dc91e992b2a49ce9724095d/src/array/specification.rs#L102-L119 # Changes Thus I propose updating the format docs to make the monotonic offsets explicit. # Background I think @jorgecarleitao's description on apache/arrow-rs#1071 (comment), explains the reason why having monotonic offsets is a good idea > I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases. > > The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead. Closes #12019 from alamb/alamb/clarify_offsets Lead-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Which issue does this PR close?
Closes #817
Rationale for this change
Improve the security arrow-rs by validating untrusted user input, and resolves several outstanding potential security issues by doing deep checks of the data passed to
ArrayData::try_new()
What changes are included in this PR?
ArrayData::try_new()
based on the checks in the C++ implementation, kindly pointed out (and contributed) by @pitrouAre there any user-facing changes?
ArrayData::try_new()
may return anErr
now on some (invalid) inputs rather than succeeding and will take more time.pub ArrayData::validate_full()
that performs fullArrayData
validationTesting
arrow-rs
tests pass (other than some different error messages) when I temporarily changed the code to callvalidate_full()
always as in (DEMO) tests pass whenvalidate_full
is forced #987 (review)datafusion
tests pass with these changes (using https://github.com/alamb/arrow-rs/tree/alamb/full_array_data_construction_validation_df which has the changes in this PR cherry-picked)Follow On:
Union
in Update Union Array to addUnionMode
, match latest Arrow Spec, and renamenew
->unsafe new_unchecked()
#885 (not backwards compatible)