Validate arguments to ArrayData::new and null bit buffer and buffers #810

alamb · 2021-09-30T21:46:37Z

Which issue does this PR close?

Part of #817

Rationale for this change

This is a step towards improving the security posture of arrow-rs and resolving RUSTSEC vulnerabilities by validating arguments to ArrayData::try_new(). This particular PR adds basic size and offset sanity checking.

I apologize for the massive PR, but the checks and tests are fairly repetitive.

See discussion on #817 for more details

What changes are included in this PR?

Add basic offset argument validation to ArrayData::try_new() based on the checks in the C++ implementation, kindly pointed out (and contributed) by @pitrou

Planned for follow on PRs:

Add validate_full: that will also check the values of any offsets buffers: Add full data validation for ArrayData::try_new() #921
Clean up UnionArray implementation / validate it properly (Implement Union validity bitmap changes from ARROW-9222 #85)

Are there any user-facing changes?

ArrayData::try_new() may return an Err now on some (invalid) inputs rather than succeeding

alamb · 2021-09-30T21:47:27Z

arrow/src/array/data.rs

+                buffers.len()
+            );
+            // if min_size is zero, may not have buffers (e.g. NullArray)
+            assert!(min_size == 0 || buffers[0].len() >= min_size,


This is the core validation check.

alamb · 2021-09-30T21:48:13Z

arrow/src/array/array_primitive.rs

@@ -869,7 +869,7 @@ mod tests {
    #[test]
    fn test_primitive_array_builder() {
        // Test building a primitive array with ArrayData builder and offset
-        let buf = Buffer::from_slice_ref(&[0, 1, 2, 3, 4]);
+        let buf = Buffer::from_slice_ref(&[0i32, 1, 2, 3, 4, 5, 6]);


These tests need to be updated because they were creating invalid Buffers (this one for example had an len of 5 and offset of 2 but only passed in an array 5 long)

alamb · 2021-09-30T21:52:20Z

@jorgecarleitao / @nevi-me / @jhorstmann do you have any concerns with this approach before I spent more time filling in the details for nested / compound structures (List, Dictionary, etc)?

For some of the structures, the actual data will likely need to be inspected (to ensure, for example, that the values in the offsets buffer of utf8 arrays are valid).

I am not sure about the performance impact of doing these validations -- if it turns out to be too great, I can imagine having a "trusted" version of ArrayData::new() (only callable within the arrow crate) or feature flagging the validation so it can be disabled for those users who chose to do so

codecov-commenter · 2021-09-30T22:01:27Z

Codecov Report

Merging #810 (92041b0) into master (62934e9) will increase coverage by 0.01%.
The diff coverage is 85.51%.

@@            Coverage Diff             @@
##           master     #810      +/-   ##
==========================================
+ Coverage   82.29%   82.31%   +0.01%     
==========================================
  Files         168      168              
  Lines       48028    48409     +381     
==========================================
+ Hits        39527    39847     +320     
- Misses       8501     8562      +61

Impacted Files	Coverage Δ
arrow/src/compute/kernels/cast.rs	`94.81% <ø> (ø)`
arrow/src/datatypes/datatype.rs	`65.95% <75.00%> (+1.02%)`	⬆️
arrow/src/array/data.rs	`79.01% <84.11%> (+4.08%)`	⬆️
arrow/src/array/array_binary.rs	`93.13% <100.00%> (+0.06%)`	⬆️
arrow/src/array/array_boolean.rs	`94.53% <100.00%> (ø)`
arrow/src/array/array_list.rs	`95.52% <100.00%> (ø)`
arrow/src/array/array_map.rs	`81.50% <100.00%> (ø)`
arrow/src/array/array_primitive.rs	`94.09% <100.00%> (+0.03%)`	⬆️
arrow/src/array/array_union.rs	`90.97% <100.00%> (+0.04%)`	⬆️
arrow/src/compute/util.rs	`98.90% <100.00%> (+<0.01%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62934e9...92041b0. Read the comment docs.

alamb · 2021-10-02T11:49:35Z

@pitrou suggests looking at https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.h
and https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc as they implement similar logic on the C++ side

jhorstmann · 2021-10-04T17:24:22Z

The C++ validation looks extensive and includes offsets and dictionary keys, so basing the logic on that makes sense.

These offsets and dictionary keys are actually my main performance concern, I think we want at least an ArrayData::new_unchecked for usage in Array::slice and ArrayData::slice. If the original offsets or keys were valid, then the slice can be assumed to also be valid. I'd also like this unsafe method to be public for anyone wanting to implement kernels outside of arrow.

alamb · 2021-10-04T19:53:13Z

Sounds good @jhorstmann -- the C++ version includes two validation methods: one that basically checks buffer sizes and one that checks the offsets, etc I will attempt to mirror the same.

Will ping you when I have this PR ready for review

alamb · 2021-10-08T18:56:51Z

This PR is not ready for review yet, but when it is I will change it to not a draft

alamb · 2021-10-24T12:29:44Z

Update: I have all the existing tests now passing

Still TODO:

Write additional tests for the validation routines
Update union type validation / sort it out (looks like arrow-rs implementation is old, and not up to current arrow spec -- see Implement Union validity bitmap changes from ARROW-9222 #85 and Add Dense/Sparse annotation to DataType::Union #814)

alamb · 2021-10-24T12:31:25Z

arrow/src/array/array_binary.rs


        // Test binary array with offset
        let array_data = ArrayData::builder(DataType::Binary)
-            .len(4)
+            .len(2)


the array data in this case is only 4 elements long, so using an offset of 1 with len 4 is incorrect (goes off the end of the array data). The same with the test below

alamb · 2021-10-24T12:35:30Z

arrow/src/array/array_boolean.rs

-            .len(5)
-            .build()
-            .unwrap();
+        let data = unsafe {


build() now fails the earlier validation check -- so to keep the test checking the internal BooleanArray checks need to use unsafe

Note that it might be best to eventually remove all Array specific checks (which I think will be redundant) in favor of consolidated checks in ArrayData.rs but I don't want to consider doing that until I have the validation checks completed

alamb · 2021-10-24T12:37:15Z

arrow/src/array/array_list.rs

        let list_data = ArrayData::builder(list_data_type)
-            .len(3)
+            .len(2)


This is a similar invalid test that the new checks identified -- value_data has only three items in it, so doing offset = 1 and len = 3 can potentially read off the end. This change corrects the len to be within bounds.

There is a very similar (and thus fixed) test in array_list.rs and one in array_map.rs

alamb · 2021-10-24T12:38:01Z

arrow/src/array/array_list.rs

-            .add_child_data(value_data)
-            .build()
-            .unwrap();
+        let list_data = unsafe {


this test is checking the ListArray specific error, but since the new ArrayData validation checks now catch the problem we need to switch to using build_unchecked() to exercise the same code path

alamb · 2021-10-24T12:48:34Z

arrow/src/array/data.rs

+        // https://github.com/apache/arrow-rs/issues/814
+        // https://github.com/apache/arrow-rs/issues/85
+        if matches!(&self.data_type, DataType::Union(..)) {
+            return Ok(());


The "safe" way here would be to fail validation always for Union to signal to the user that the validation checks are not correct.

I think a better story would be to fix #85 and #814 . I am torn on what to do in this case

I finally settled on leaving UnionArray unvalidated for this PR so that I can backport it to the 6.x release line (it is backwards compatible) and in a PR that is backward incompatible I will fixup the UnionArray implementation (and validation)

alamb · 2021-10-30T12:21:46Z

arrow/src/array/array_union.rs

@@ -137,7 +137,10 @@ impl UnionArray {
            }
        }

-        Ok(Self::new(type_ids, value_offsets, child_arrays, bitmap))
+        let new_self = Self::new(type_ids, value_offsets, child_arrays, bitmap);
+        new_self.data().validate()?;


As part of #85 I will clean up the union array validation

alamb · 2021-10-30T12:45:27Z

@jhorstmann, @paddyhoran @nevi-me or @jorgecarleitao might I trouble one of you for a review of this PR? I know it is large, but I think it is important and fairly mechanical in terms of validating the creation of ArrayData

alamb · 2021-10-30T13:30:32Z

FYI the MIRI failure is likely the same as #879 -- I'll plan to look at that shortly if no one else gets to it

jhorstmann · 2021-11-04T14:44:01Z

arrow/src/array/data.rs

+            }
+        }
+
+        if self.null_count > self.len {


I'm wondering what could happen if the null_count actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None anyway, so we calculate a number which is guaranteed to be in range. I would suggest to remove the null_count parameter from try_new to completely avoid of inconsistencies.

Kernels that want to avoid the overhead of counting bits could use the unsafe new_unchecked method. I see it is currently set by the cast_array_data function. To make that one safe while avoiding bit counting, we could just validate that the from and to layouts are equal.

I'm wondering what could happen if the null_count actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None anyway, so we calculate a number which is guaranteed to be in range.

I suspect (but can not prove) that if null_count is inconsistent with the validity buffer then there may be incorrect answers but not undefined behavior (in the Rust sense).

My plan will be:

In a separate PR (as it will not be backwards compatible) remove the null_count from try_new(): Remove null_count from ArrayData::try_new() #911

in the PR where I add the full on data checking (validate_full()) ensure that the declared null count matches the validity bitmap

arrow/src/array/data.rs

jhorstmann · 2021-11-04T16:47:47Z

arrow/src/datatypes/datatype.rs

@@ -477,6 +477,15 @@ impl DataType {
        )
    }

+    /// Returns true if this type is integral: (UInt*, Unit*).
+    pub fn is_integer(t: &DataType) -> bool {


This seems to be only used for validating dictionary key types, maybe rename to is_dictionary_key_type and link the ArrowDictionaryKeyType trait in the comment.

Done in 8acd8d2

jhorstmann · 2021-11-04T16:52:16Z

I think this looks good. Might be worthwhile to convert some of the current new_unchecked usages to try_new to increase coverage.

…ta_construction

alamb · 2021-11-04T19:33:28Z

I think this looks good. Might be worthwhile to convert some of the current new_unchecked usages to try_new to increase coverage.

At the moment, almost all of the tests in arrow use ArrayData::try_new(...).unwrap() so there is pretty good coverage of this code already (I found several bugs this way :) )

The remaining uses of new_unchecked() are either in compute kernels where performance seems to be critical or in tests which are explicitly checking array construction validation.

What I plan as part of implementing validate_full() is to remove all array specific type checks and the tests to use ArrayData::try_new()

arrow/src/array/data.rs

alamb · 2021-11-05T18:26:00Z

Thanks @jhorstmann for the review. I think this PR is now ready to go and will plan to merge it early next week unless anyone objects or would like more time to review.

…ta_construction

…810) * Validate arguments to ArrayData::new: null bit buffer and buffers * REname is_int_type to is_dictionary_key_type() * Correctly handle self.offset in offsets buffer * Consolidate checks * Fix test output

…810) (#936) * Validate arguments to ArrayData::new: null bit buffer and buffers * REname is_int_type to is_dictionary_key_type() * Correctly handle self.offset in offsets buffer * Consolidate checks * Fix test output

alamb changed the title ~~Validate arguments to ArrayData::new: null bit buffer and buffers (WIP)~~ Validate arguments to ArrayData::new and null bit buffer and buffers (WIP) Sep 30, 2021

github-actions bot added the arrow Changes to the arrow crate label Sep 30, 2021

alamb commented Sep 30, 2021

View reviewed changes

jhorstmann mentioned this pull request Oct 5, 2021

Investigate usages of ArrayData::new (WIP) #813

Closed

4 tasks

This was referenced Oct 6, 2021

Add Dense/Sparse annotation to DataType::Union #814

Closed

Incorrect null count for cast kernel for list arrays #815

Closed

alamb force-pushed the alamb/safe_array_data_construction branch from 33a54ee to 73b249d Compare October 6, 2021 14:42

github-actions bot added the arrow-flight Changes to the arrow-flight crate label Oct 6, 2021

nevi-me self-requested a review October 7, 2021 00:36

alamb force-pushed the alamb/safe_array_data_construction branch from 005c173 to 68d9a66 Compare October 8, 2021 18:56

alamb marked this pull request as draft October 8, 2021 18:56

alamb mentioned this pull request Oct 9, 2021

Replace ArrayData::new() with ArrayData::try_new() and unsafe ArrayData::new_unchecked #822

Merged

4 tasks

alamb force-pushed the alamb/safe_array_data_construction branch 2 times, most recently from 1207c89 to b270574 Compare October 14, 2021 22:05

alamb force-pushed the alamb/safe_array_data_construction branch from add6e56 to a63b01e Compare October 24, 2021 11:43

alamb mentioned this pull request Oct 24, 2021

IPC big endian offsets are not translated #859

Closed

alamb force-pushed the alamb/safe_array_data_construction branch from a63b01e to d3afda6 Compare October 24, 2021 12:27

alamb mentioned this pull request Oct 24, 2021

Implement Union validity bitmap changes from ARROW-9222 #85

Closed

alamb force-pushed the alamb/safe_array_data_construction branch from edd2910 to 21b9685 Compare October 24, 2021 12:52

alamb commented Oct 25, 2021

View reviewed changes

alamb mentioned this pull request Oct 29, 2021

Validate arguments to ArrayData::try_new() #817

Closed

3 tasks

alamb force-pushed the alamb/safe_array_data_construction branch from 21b9685 to b37c8a1 Compare October 30, 2021 12:20

alamb commented Oct 30, 2021

View reviewed changes

alamb mentioned this pull request Oct 30, 2021

Update Union Array to add UnionMode, match latest Arrow Spec, and rename new -> unsafe new_unchecked() #885

Merged

Validate arguments to ArrayData::new: null bit buffer and buffers

4c5e3be

alamb force-pushed the alamb/safe_array_data_construction branch from b37c8a1 to 4c5e3be Compare November 2, 2021 20:37

jhorstmann reviewed Nov 4, 2021

View reviewed changes

arrow/src/array/data.rs Outdated Show resolved Hide resolved

jhorstmann reviewed Nov 4, 2021

View reviewed changes

alamb mentioned this pull request Nov 4, 2021

Remove null_count from ArrayData::try_new() #911

Closed

alamb added 2 commits November 4, 2021 15:17

Merge remote-tracking branch 'apache/master' into alamb/safe_array_da…

5c8da87

…ta_construction

REname is_int_type to is_dictionary_key_type()

8acd8d2

alamb mentioned this pull request Nov 4, 2021

Add full data validation for ArrayData::try_new() #921

Merged

1 task

jhorstmann reviewed Nov 5, 2021

View reviewed changes

arrow/src/array/data.rs Outdated Show resolved Hide resolved

alamb added 2 commits November 5, 2021 14:11

Correctly handle self.offset in offsets buffer

88a41b3

Consolidate checks

1446674

alamb added 2 commits November 6, 2021 07:05

Merge remote-tracking branch 'apache/master' into alamb/safe_array_da…

4136241

…ta_construction

Fix test output

92041b0

alamb added the security label Nov 8, 2021

alamb merged commit 74b520c into apache:master Nov 8, 2021

alamb deleted the alamb/safe_array_data_construction branch November 8, 2021 19:11

alamb mentioned this pull request Nov 9, 2021

Cherry pick Validate arguments to ArrayData::new and null bit buffer and buffers #935

Closed

alamb added the cherry-picked label Nov 9, 2021

alamb mentioned this pull request Nov 9, 2021

Cherry pick Validate arguments to ArrayData::new and null bit buffer and buffers to active_release #936

Merged

alamb mentioned this pull request Nov 10, 2021

Validation error when manually copying array data for a slice array #940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate arguments to ArrayData::new and null bit buffer and buffers #810

Validate arguments to ArrayData::new and null bit buffer and buffers #810

alamb commented Sep 30, 2021 •

edited

Loading

alamb Sep 30, 2021

alamb Sep 30, 2021

alamb commented Sep 30, 2021

codecov-commenter commented Sep 30, 2021 •

edited

Loading

alamb commented Oct 2, 2021

jhorstmann commented Oct 4, 2021

alamb commented Oct 4, 2021

alamb commented Oct 8, 2021

alamb commented Oct 24, 2021 •

edited

Loading

alamb Oct 24, 2021

alamb Oct 24, 2021

alamb Oct 24, 2021

alamb Oct 24, 2021

alamb Oct 24, 2021

alamb Oct 30, 2021

alamb Oct 30, 2021

alamb commented Oct 30, 2021

alamb commented Oct 30, 2021

jhorstmann Nov 4, 2021

alamb Nov 4, 2021

jhorstmann Nov 4, 2021

alamb Nov 4, 2021 •

edited

Loading

jhorstmann commented Nov 4, 2021

alamb commented Nov 4, 2021

alamb commented Nov 5, 2021

Validate arguments to ArrayData::new and null bit buffer and buffers #810

Validate arguments to ArrayData::new and null bit buffer and buffers #810

Conversation

alamb commented Sep 30, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 30, 2021

codecov-commenter commented Sep 30, 2021 • edited Loading

Codecov Report

alamb commented Oct 2, 2021

jhorstmann commented Oct 4, 2021

alamb commented Oct 4, 2021

alamb commented Oct 8, 2021

alamb commented Oct 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 30, 2021

alamb commented Oct 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Nov 4, 2021 • edited Loading

Choose a reason for hiding this comment

jhorstmann commented Nov 4, 2021

alamb commented Nov 4, 2021

alamb commented Nov 5, 2021

alamb commented Sep 30, 2021 •

edited

Loading

codecov-commenter commented Sep 30, 2021 •

edited

Loading

alamb commented Oct 24, 2021 •

edited

Loading

alamb Nov 4, 2021 •

edited

Loading