Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
…rrow.array.ListArray.fromArrays` (apache#38531) ### Rationale for this change This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid. ### What changes are included in this PR? 1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function. 2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays. 3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function. Supported values for `arrow.array.ValidationMode` include: * `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`. * `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood. * `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood. **Example** ```matlab >> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])) offsets = Int32Array with 6 elements and 0 null values: 0 | 1 | 2 | 3 | 4 | 5 >> values = arrow.array([1, 2, 3]) values = Float64Array with 3 elements and 0 null values: 1 | 2 | 3 >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full") Error using . Offset invariant failure: offset for slot 4 out of bounds: 4 > 3 Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal") Error using . Length spanned by list offsets (5) larger than values array (length 3) Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none") array = ListArray with 5 elements and 0 null values: <Invalid array: Length spanned by list offsets (5) larger than values array (length 3)> ``` ### Are these changes tested? Yes. 1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`. ### Are there any user-facing changes? Yes. 1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair. 2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`. 3. Client code can now create `arrow.array.ValidationMode` enumeration values. **This PR includes breaking changes to public APIs.** Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`. ### Future Directions 1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`. ### Notes 1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#38361 Authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
- Loading branch information