-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate arguments to ArrayData::new and null bit buffer and buffers #810
Conversation
arrow/src/array/data.rs
Outdated
buffers.len() | ||
); | ||
// if min_size is zero, may not have buffers (e.g. NullArray) | ||
assert!(min_size == 0 || buffers[0].len() >= min_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core validation check.
@@ -869,7 +869,7 @@ mod tests { | |||
#[test] | |||
fn test_primitive_array_builder() { | |||
// Test building a primitive array with ArrayData builder and offset | |||
let buf = Buffer::from_slice_ref(&[0, 1, 2, 3, 4]); | |||
let buf = Buffer::from_slice_ref(&[0i32, 1, 2, 3, 4, 5, 6]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests need to be updated because they were creating invalid Buffers (this one for example had an len of 5 and offset of 2 but only passed in an array 5 long)
@jorgecarleitao / @nevi-me / @jhorstmann do you have any concerns with this approach before I spent more time filling in the details for nested / compound structures ( For some of the structures, the actual data will likely need to be inspected (to ensure, for example, that the values in the offsets buffer of utf8 arrays are valid). I am not sure about the performance impact of doing these validations -- if it turns out to be too great, I can imagine having a "trusted" version of |
Codecov Report
@@ Coverage Diff @@
## master #810 +/- ##
==========================================
+ Coverage 82.29% 82.31% +0.01%
==========================================
Files 168 168
Lines 48028 48409 +381
==========================================
+ Hits 39527 39847 +320
- Misses 8501 8562 +61
Continue to review full report at Codecov.
|
@pitrou suggests looking at https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.h |
The C++ validation looks extensive and includes offsets and dictionary keys, so basing the logic on that makes sense. These offsets and dictionary keys are actually my main performance concern, I think we want at least an |
Sounds good @jhorstmann -- the C++ version includes two validation methods: one that basically checks buffer sizes and one that checks the offsets, etc I will attempt to mirror the same. Will ping you when I have this PR ready for review |
33a54ee
to
73b249d
Compare
005c173
to
68d9a66
Compare
This PR is not ready for review yet, but when it is I will change it to not a draft |
1207c89
to
b270574
Compare
add6e56
to
a63b01e
Compare
a63b01e
to
d3afda6
Compare
Update: I have all the existing tests now passing Still TODO:
|
edd2910
to
21b9685
Compare
|
||
// Test binary array with offset | ||
let array_data = ArrayData::builder(DataType::Binary) | ||
.len(4) | ||
.len(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the array data in this case is only 4 elements long, so using an offset of 1 with len 4 is incorrect (goes off the end of the array data). The same with the test below
.len(5) | ||
.build() | ||
.unwrap(); | ||
let data = unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build()
now fails the earlier validation check -- so to keep the test checking the internal BooleanArray checks need to use unsafe
Note that it might be best to eventually remove all Array specific checks (which I think will be redundant) in favor of consolidated checks in ArrayData.rs but I don't want to consider doing that until I have the validation checks completed
let list_data = ArrayData::builder(list_data_type) | ||
.len(3) | ||
.len(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a similar invalid test that the new checks identified -- value_data
has only three items in it, so doing offset = 1 and len = 3 can potentially read off the end. This change corrects the len to be within bounds.
There is a very similar (and thus fixed) test in array_list.rs and one in array_map.rs
.add_child_data(value_data) | ||
.build() | ||
.unwrap(); | ||
let list_data = unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is checking the ListArray specific error, but since the new ArrayData validation checks now catch the problem we need to switch to using build_unchecked()
to exercise the same code path
// https://github.com/apache/arrow-rs/issues/814 | ||
// https://github.com/apache/arrow-rs/issues/85 | ||
if matches!(&self.data_type, DataType::Union(..)) { | ||
return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally settled on leaving UnionArray
unvalidated for this PR so that I can backport it to the 6.x release line (it is backwards compatible) and in a PR that is backward incompatible I will fixup the UnionArray implementation (and validation)
21b9685
to
b37c8a1
Compare
@@ -137,7 +137,10 @@ impl UnionArray { | |||
} | |||
} | |||
|
|||
Ok(Self::new(type_ids, value_offsets, child_arrays, bitmap)) | |||
let new_self = Self::new(type_ids, value_offsets, child_arrays, bitmap); | |||
new_self.data().validate()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of #85 I will clean up the union array validation
@jhorstmann, @paddyhoran @nevi-me or @jorgecarleitao might I trouble one of you for a review of this PR? I know it is large, but I think it is important and fairly mechanical in terms of validating the creation of ArrayData |
FYI the MIRI failure is likely the same as #879 -- I'll plan to look at that shortly if no one else gets to it |
b37c8a1
to
4c5e3be
Compare
} | ||
} | ||
|
||
if self.null_count > self.len { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what could happen if the null_count
actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None
anyway, so we calculate a number which is guaranteed to be in range. I would suggest to remove the null_count
parameter from try_new
to completely avoid of inconsistencies.
Kernels that want to avoid the overhead of counting bits could use the unsafe new_unchecked
method. I see it is currently set by the cast_array_data
function. To make that one safe while avoiding bit counting, we could just validate that the from
and to
layouts are equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what could happen if the null_count actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None anyway, so we calculate a number which is guaranteed to be in range.
I suspect (but can not prove) that if null_count
is inconsistent with the validity buffer then there may be incorrect answers but not undefined behavior (in the Rust sense).
My plan will be:
- In a separate PR (as it will not be backwards compatible) remove the
null_count
fromtry_new()
: Removenull_count
from ArrayData::try_new() #911 - in the PR where I add the full on data checking (
validate_full()
) ensure that the declared null count matches the validity bitmap
arrow/src/datatypes/datatype.rs
Outdated
@@ -477,6 +477,15 @@ impl DataType { | |||
) | |||
} | |||
|
|||
/// Returns true if this type is integral: (UInt*, Unit*). | |||
pub fn is_integer(t: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be only used for validating dictionary key types, maybe rename to is_dictionary_key_type
and link the ArrowDictionaryKeyType
trait in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8acd8d2
I think this looks good. Might be worthwhile to convert some of the current |
At the moment, almost all of the tests in arrow use The remaining uses of What I plan as part of implementing |
Thanks @jhorstmann for the review. I think this PR is now ready to go and will plan to merge it early next week unless anyone objects or would like more time to review. |
…810) * Validate arguments to ArrayData::new: null bit buffer and buffers * REname is_int_type to is_dictionary_key_type() * Correctly handle self.offset in offsets buffer * Consolidate checks * Fix test output
…810) * Validate arguments to ArrayData::new: null bit buffer and buffers * REname is_int_type to is_dictionary_key_type() * Correctly handle self.offset in offsets buffer * Consolidate checks * Fix test output
Which issue does this PR close?
Part of #817
Rationale for this change
This is a step towards improving the security posture of arrow-rs and resolving RUSTSEC vulnerabilities by validating arguments to ArrayData::try_new(). This particular PR adds basic size and offset sanity checking.
I apologize for the massive PR, but the checks and tests are fairly repetitive.
See discussion on #817 for more details
What changes are included in this PR?
ArrayData::try_new()
based on the checks in the C++ implementation, kindly pointed out (and contributed) by @pitrouPlanned for follow on PRs:
validate_full
: that will also check the values of any offsets buffers: Add full data validation for ArrayData::try_new() #921UnionArray
implementation / validate it properly (Implement Union validity bitmap changes from ARROW-9222 #85)Are there any user-facing changes?
ArrayData::try_new()
may return anErr
now on some (invalid) inputs rather than succeeding