-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38361: [MATLAB] Add validation logic for offsets
and values
to arrow.array.ListArray.fromArrays
#38531
Conversation
2. Hard-code enumeration values for `ValidationMode` to match in both MATLAB and C++. 3. Set default `ValidationMode` to `Minimal` for `arrow.array.ListArrays.fromArrays`. 4. Move `arrow.array.internal.validation.ValidationMode` to `arrow.array.ValidationMode`.
are valid, no error is thrown, even if `ValidationMode` is set to `"Minimal"` or `"Full"`.
Apologies for prematurely leaving comments on this PR. I did not realize it was still a draft. |
No worries at all! Thanks for the helpful feedback! |
+1 |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 66844e9. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
offsets
and values
to arrow.array.ListArray.fromArrays
offsets
and values
to arrow.array.ListArray.fromArrays
…rrow.array.ListArray.fromArrays` (apache#38531) ### Rationale for this change This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid. ### What changes are included in this PR? 1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function. 2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays. 3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function. Supported values for `arrow.array.ValidationMode` include: * `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`. * `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood. * `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood. **Example** ```matlab >> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])) offsets = Int32Array with 6 elements and 0 null values: 0 | 1 | 2 | 3 | 4 | 5 >> values = arrow.array([1, 2, 3]) values = Float64Array with 3 elements and 0 null values: 1 | 2 | 3 >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full") Error using . Offset invariant failure: offset for slot 4 out of bounds: 4 > 3 Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal") Error using . Length spanned by list offsets (5) larger than values array (length 3) Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none") array = ListArray with 5 elements and 0 null values: <Invalid array: Length spanned by list offsets (5) larger than values array (length 3)> ``` ### Are these changes tested? Yes. 1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`. ### Are there any user-facing changes? Yes. 1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair. 2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`. 3. Client code can now create `arrow.array.ValidationMode` enumeration values. **This PR includes breaking changes to public APIs.** Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`. ### Future Directions 1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`. ### Notes 1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#38361 Authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
…rrow.array.ListArray.fromArrays` (apache#38531) ### Rationale for this change This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid. ### What changes are included in this PR? 1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function. 2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays. 3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function. Supported values for `arrow.array.ValidationMode` include: * `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`. * `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood. * `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood. **Example** ```matlab >> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])) offsets = Int32Array with 6 elements and 0 null values: 0 | 1 | 2 | 3 | 4 | 5 >> values = arrow.array([1, 2, 3]) values = Float64Array with 3 elements and 0 null values: 1 | 2 | 3 >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full") Error using . Offset invariant failure: offset for slot 4 out of bounds: 4 > 3 Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal") Error using . Length spanned by list offsets (5) larger than values array (length 3) Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none") array = ListArray with 5 elements and 0 null values: <Invalid array: Length spanned by list offsets (5) larger than values array (length 3)> ``` ### Are these changes tested? Yes. 1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`. ### Are there any user-facing changes? Yes. 1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair. 2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`. 3. Client code can now create `arrow.array.ValidationMode` enumeration values. **This PR includes breaking changes to public APIs.** Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`. ### Future Directions 1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`. ### Notes 1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#38361 Authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
Rationale for this change
This pull request adds a new
ValidationMode
name-value pair to thearrow.array.ListArray.fromArrays
function. This allows client code to validate whether providedoffsets
andvalues
are valid.What changes are included in this PR?
ValidationMode = "None" | "Minimal" (default) | "Full".
to thearrow.array.ListArrays.fromArrays
function. IfValidationMode
is set to"Minimal"
or"Full"
and the providedoffsets
andvalues
arrays are invalid, then an error will be thrown when calling thearrow.array.ListArrays.fromArrays
function.ValidationMode
forarrow.array.ListArray.fromArrays
to"Minimal"
to balance usability and performance when creatingListArray
s. Hopefully, this should help more MATLAB users navigate the complexities of creatingListArray
s "from scratch" usingoffsets
andvalues
arrays.arrow.array.ValidationMode
enumeration class. This is used as the type of theValidationMode
name-value pair on thearrow.array.ListArray.fromArrays
function.Supported values for
arrow.array.ValidationMode
include:arrow.array.ValidationMode.None
- Do no validation checks on the givenArray
.arrow.array.ValidationMode.Minimal
- Do relatively inexpensive validation checks on the givenArray
. Delegates to the C++Array::Validate
method under the hood.arrow.array.ValidationMode.Full
- Do expensive / robust validation checks on the givenArray
. Delegates to the C++Array::ValidateFull
method under the hood.Example
Are these changes tested?
Yes.
ValidationMode
behavior totListArray.m
.Are there any user-facing changes?
Yes.
arrow.array.ListArray.fromArrays
by using the newValidationMode
name-value pair.arrow.array.ListArray.fromArrays
for certain invalid combinations ofoffsets
andvalues
. In other words,arrow.array.ListArray.fromArrays
will call the C++ methodArray::Validate
by default, which corresponds toarrow.array.ValidationMode.Minimal
.arrow.array.ValidationMode
enumeration values.This PR includes breaking changes to public APIs.
Previously, all
offsets
andvalues
would be accepted by thearrow.array.ListArray.fromArrays
function. However, this pull request changes the default behavior to call the C++Array::Validate
method under the hood, which means that some previously acceptedoffsets
andvalues
will now result in a validation error. This can be worked around by settingValidationMode
to"None"
when callingarrow.array.ListArray.fromArrays
.Future Directions
ValidationMode
has only been added to thearrow.array.ListArray.fromArrays
method. However, in the future, it may make sense to generalize validation behavior and provideValidationMode
on otherfromMATLAB
andfromArrays
methods for otherArray
types. We may also want to add a stand-alonevalidate
method on allarrow.array.Array
classes ([MATLAB] Add avalidate
method to allarrow.array.Array
classes #38532). We decided to start withListArray
as an incremental first step since we suspect creating validListArray
s fromoffsets
andvalues
will generally be more error prone than creating simplerArray
types likeFloat64Array
orStringArray
.Notes
ValidationMode
value toarrow.array.ValidationMode.Minimal
to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this toarrow.array.ValidationMode.None
in the future.offsets
andvalues
toarrow.array.ListArray.fromArrays
#38361