-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MATLAB] Add validation logic for offsets
and values
to arrow.array.ListArray.fromArrays
#38361
Comments
kevingurney
added a commit
that referenced
this issue
Oct 23, 2023
### Rationale for this change Now that many of the commonly-used "primitive" array types have been added to the MATLAB interface, we can implement an `arrow.array.ListArray` class. This pull request adds a new `arrow.array.ListArray` class which can be converted to a MATLAB `cell` array by calling the static `toMATLAB` method. ### What changes are included in this PR? 1. Added a new `arrow.array.ListArray` MATLAB class. *Methods* `cellArray = arrow.array.ListArray.toMATLAB()` `listArray = arrow.array.ListArray.fromArrays(offsets, values)` *Properties* `Offsets` - `Int32Array` list offsets (uses zero-based indexing) `Values` - Array of values in the list (supports nesting) 2. Added a new `arrow.type.traits.ListTraits` MATLAB class. **Example** ```matlab >> offsets = arrow.array(int32([0, 2, 3, 7])) offsets = [ 0, 2, 3, 7 ] >> values = arrow.array(["A", "B", "C", "D", "E", "F", "G"]) values = [ "A", "B", "C", "D", "E", "F", "G" ] >> arrowArray = arrow.array.ListArray.fromArrays(offsets, values) arrowArray = [ [ "A", "B" ], [ "C" ], [ "D", "E", "F", "G" ] ] >> matlabArray = arrowArray.toMATLAB() matlabArray = 3x1 cell array {2x1 string} {["C" ]} {4x1 string} >> matlabArray{:} ans = 2x1 string array "A" "B" ans = "C" ans = 4x1 string array "D" "E" "F" "G" ``` ### Are these changes tested? Yes. 1. Added a new `tListArray.m` test class. 2. Added a new `tListTraits.m` test class. 3. Updated `arrow.internal.test.tabular.createAllSupportedArrayTypes` to include `ListArray`. ### Are there any user-facing changes? Yes. 1. Users can now create an `arrow.array.ListArray` from an `offsets` and `values` array by calling the static `arrow.array.ListArray.fromArrays(offsets, values)` method. `ListArray`s can be converted into MATLAB `cell` arrays by calling the static `arrow.array.ListArray.toMATLAB` method. ### Notes 1. We chose to use the "missing-class" `missing` value as the `NullSubstitutionValue` for the time being for `ListArray`. However, we eventually want to add `arrow.array.NullArray`, and will most likely want to use the "missing-class" `missing` value to represent `NullArray` values in MATLAB. So, this could cause some ambiguity in the future. We have been thinking about whether we should consider introducing some sort of special "sentinel value" to represent null values when converting to MATLAB `cell` arrays. Perhaps, something like `arrow.Null`, or something to that effect, in order to avoid this ambiguity. If we think it makes sense to do that, we may want to retroactively change the `NullSubstitutionValue` to be `arrow.Null` and break compatibility. Since we are still in pre-`0.1`, we don't think the impact of such a behavior change would be very large. 2. Implementing `ListArray` is fairly involved. So, in the spirit of incremental delivery, we chose not to include an implementation of `arrow.array.ListArray.fromMATLAB` in this initial pull request. We plan on following up with some more changes to `arrow.array.ListArray`. See #38353, #38354, and #38361. 3. Thank you @ sgilmore10 for your help with this pull request! ### Future Directions 1. #38353 2. #38354 3. #38361 4. Consider adding a null sentinel value like `arrow.Null` for conversion to MATLAB `cell` arrays. * Closes: #37815 Lead-authored-by: Kevin Gurney <[email protected]> Co-authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
JerAguilon
pushed a commit
to JerAguilon/arrow
that referenced
this issue
Oct 25, 2023
…ache#38357) ### Rationale for this change Now that many of the commonly-used "primitive" array types have been added to the MATLAB interface, we can implement an `arrow.array.ListArray` class. This pull request adds a new `arrow.array.ListArray` class which can be converted to a MATLAB `cell` array by calling the static `toMATLAB` method. ### What changes are included in this PR? 1. Added a new `arrow.array.ListArray` MATLAB class. *Methods* `cellArray = arrow.array.ListArray.toMATLAB()` `listArray = arrow.array.ListArray.fromArrays(offsets, values)` *Properties* `Offsets` - `Int32Array` list offsets (uses zero-based indexing) `Values` - Array of values in the list (supports nesting) 2. Added a new `arrow.type.traits.ListTraits` MATLAB class. **Example** ```matlab >> offsets = arrow.array(int32([0, 2, 3, 7])) offsets = [ 0, 2, 3, 7 ] >> values = arrow.array(["A", "B", "C", "D", "E", "F", "G"]) values = [ "A", "B", "C", "D", "E", "F", "G" ] >> arrowArray = arrow.array.ListArray.fromArrays(offsets, values) arrowArray = [ [ "A", "B" ], [ "C" ], [ "D", "E", "F", "G" ] ] >> matlabArray = arrowArray.toMATLAB() matlabArray = 3x1 cell array {2x1 string} {["C" ]} {4x1 string} >> matlabArray{:} ans = 2x1 string array "A" "B" ans = "C" ans = 4x1 string array "D" "E" "F" "G" ``` ### Are these changes tested? Yes. 1. Added a new `tListArray.m` test class. 2. Added a new `tListTraits.m` test class. 3. Updated `arrow.internal.test.tabular.createAllSupportedArrayTypes` to include `ListArray`. ### Are there any user-facing changes? Yes. 1. Users can now create an `arrow.array.ListArray` from an `offsets` and `values` array by calling the static `arrow.array.ListArray.fromArrays(offsets, values)` method. `ListArray`s can be converted into MATLAB `cell` arrays by calling the static `arrow.array.ListArray.toMATLAB` method. ### Notes 1. We chose to use the "missing-class" `missing` value as the `NullSubstitutionValue` for the time being for `ListArray`. However, we eventually want to add `arrow.array.NullArray`, and will most likely want to use the "missing-class" `missing` value to represent `NullArray` values in MATLAB. So, this could cause some ambiguity in the future. We have been thinking about whether we should consider introducing some sort of special "sentinel value" to represent null values when converting to MATLAB `cell` arrays. Perhaps, something like `arrow.Null`, or something to that effect, in order to avoid this ambiguity. If we think it makes sense to do that, we may want to retroactively change the `NullSubstitutionValue` to be `arrow.Null` and break compatibility. Since we are still in pre-`0.1`, we don't think the impact of such a behavior change would be very large. 2. Implementing `ListArray` is fairly involved. So, in the spirit of incremental delivery, we chose not to include an implementation of `arrow.array.ListArray.fromMATLAB` in this initial pull request. We plan on following up with some more changes to `arrow.array.ListArray`. See apache#38353, apache#38354, and apache#38361. 3. Thank you @ sgilmore10 for your help with this pull request! ### Future Directions 1. apache#38353 2. apache#38354 3. apache#38361 4. Consider adding a null sentinel value like `arrow.Null` for conversion to MATLAB `cell` arrays. * Closes: apache#37815 Lead-authored-by: Kevin Gurney <[email protected]> Co-authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
kevingurney
added a commit
that referenced
this issue
Oct 31, 2023
…rray.ListArray.fromArrays` (#38531) ### Rationale for this change This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid. ### What changes are included in this PR? 1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function. 2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays. 3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function. Supported values for `arrow.array.ValidationMode` include: * `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`. * `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood. * `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood. **Example** ```matlab >> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])) offsets = Int32Array with 6 elements and 0 null values: 0 | 1 | 2 | 3 | 4 | 5 >> values = arrow.array([1, 2, 3]) values = Float64Array with 3 elements and 0 null values: 1 | 2 | 3 >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full") Error using . Offset invariant failure: offset for slot 4 out of bounds: 4 > 3 Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal") Error using . Length spanned by list offsets (5) larger than values array (length 3) Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none") array = ListArray with 5 elements and 0 null values: <Invalid array: Length spanned by list offsets (5) larger than values array (length 3)> ``` ### Are these changes tested? Yes. 1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`. ### Are there any user-facing changes? Yes. 1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair. 2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`. 3. Client code can now create `arrow.array.ValidationMode` enumeration values. **This PR includes breaking changes to public APIs.** Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`. ### Future Directions 1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`. ### Notes 1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: #38361 Authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this issue
Nov 13, 2023
…ache#38357) ### Rationale for this change Now that many of the commonly-used "primitive" array types have been added to the MATLAB interface, we can implement an `arrow.array.ListArray` class. This pull request adds a new `arrow.array.ListArray` class which can be converted to a MATLAB `cell` array by calling the static `toMATLAB` method. ### What changes are included in this PR? 1. Added a new `arrow.array.ListArray` MATLAB class. *Methods* `cellArray = arrow.array.ListArray.toMATLAB()` `listArray = arrow.array.ListArray.fromArrays(offsets, values)` *Properties* `Offsets` - `Int32Array` list offsets (uses zero-based indexing) `Values` - Array of values in the list (supports nesting) 2. Added a new `arrow.type.traits.ListTraits` MATLAB class. **Example** ```matlab >> offsets = arrow.array(int32([0, 2, 3, 7])) offsets = [ 0, 2, 3, 7 ] >> values = arrow.array(["A", "B", "C", "D", "E", "F", "G"]) values = [ "A", "B", "C", "D", "E", "F", "G" ] >> arrowArray = arrow.array.ListArray.fromArrays(offsets, values) arrowArray = [ [ "A", "B" ], [ "C" ], [ "D", "E", "F", "G" ] ] >> matlabArray = arrowArray.toMATLAB() matlabArray = 3x1 cell array {2x1 string} {["C" ]} {4x1 string} >> matlabArray{:} ans = 2x1 string array "A" "B" ans = "C" ans = 4x1 string array "D" "E" "F" "G" ``` ### Are these changes tested? Yes. 1. Added a new `tListArray.m` test class. 2. Added a new `tListTraits.m` test class. 3. Updated `arrow.internal.test.tabular.createAllSupportedArrayTypes` to include `ListArray`. ### Are there any user-facing changes? Yes. 1. Users can now create an `arrow.array.ListArray` from an `offsets` and `values` array by calling the static `arrow.array.ListArray.fromArrays(offsets, values)` method. `ListArray`s can be converted into MATLAB `cell` arrays by calling the static `arrow.array.ListArray.toMATLAB` method. ### Notes 1. We chose to use the "missing-class" `missing` value as the `NullSubstitutionValue` for the time being for `ListArray`. However, we eventually want to add `arrow.array.NullArray`, and will most likely want to use the "missing-class" `missing` value to represent `NullArray` values in MATLAB. So, this could cause some ambiguity in the future. We have been thinking about whether we should consider introducing some sort of special "sentinel value" to represent null values when converting to MATLAB `cell` arrays. Perhaps, something like `arrow.Null`, or something to that effect, in order to avoid this ambiguity. If we think it makes sense to do that, we may want to retroactively change the `NullSubstitutionValue` to be `arrow.Null` and break compatibility. Since we are still in pre-`0.1`, we don't think the impact of such a behavior change would be very large. 2. Implementing `ListArray` is fairly involved. So, in the spirit of incremental delivery, we chose not to include an implementation of `arrow.array.ListArray.fromMATLAB` in this initial pull request. We plan on following up with some more changes to `arrow.array.ListArray`. See apache#38353, apache#38354, and apache#38361. 3. Thank you @ sgilmore10 for your help with this pull request! ### Future Directions 1. apache#38353 2. apache#38354 3. apache#38361 4. Consider adding a null sentinel value like `arrow.Null` for conversion to MATLAB `cell` arrays. * Closes: apache#37815 Lead-authored-by: Kevin Gurney <[email protected]> Co-authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this issue
Nov 13, 2023
…rrow.array.ListArray.fromArrays` (apache#38531) ### Rationale for this change This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid. ### What changes are included in this PR? 1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function. 2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays. 3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function. Supported values for `arrow.array.ValidationMode` include: * `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`. * `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood. * `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood. **Example** ```matlab >> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])) offsets = Int32Array with 6 elements and 0 null values: 0 | 1 | 2 | 3 | 4 | 5 >> values = arrow.array([1, 2, 3]) values = Float64Array with 3 elements and 0 null values: 1 | 2 | 3 >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full") Error using . Offset invariant failure: offset for slot 4 out of bounds: 4 > 3 Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal") Error using . Length spanned by list offsets (5) larger than values array (length 3) Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none") array = ListArray with 5 elements and 0 null values: <Invalid array: Length spanned by list offsets (5) larger than values array (length 3)> ``` ### Are these changes tested? Yes. 1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`. ### Are there any user-facing changes? Yes. 1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair. 2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`. 3. Client code can now create `arrow.array.ValidationMode` enumeration values. **This PR includes breaking changes to public APIs.** Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`. ### Future Directions 1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`. ### Notes 1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#38361 Authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…ache#38357) ### Rationale for this change Now that many of the commonly-used "primitive" array types have been added to the MATLAB interface, we can implement an `arrow.array.ListArray` class. This pull request adds a new `arrow.array.ListArray` class which can be converted to a MATLAB `cell` array by calling the static `toMATLAB` method. ### What changes are included in this PR? 1. Added a new `arrow.array.ListArray` MATLAB class. *Methods* `cellArray = arrow.array.ListArray.toMATLAB()` `listArray = arrow.array.ListArray.fromArrays(offsets, values)` *Properties* `Offsets` - `Int32Array` list offsets (uses zero-based indexing) `Values` - Array of values in the list (supports nesting) 2. Added a new `arrow.type.traits.ListTraits` MATLAB class. **Example** ```matlab >> offsets = arrow.array(int32([0, 2, 3, 7])) offsets = [ 0, 2, 3, 7 ] >> values = arrow.array(["A", "B", "C", "D", "E", "F", "G"]) values = [ "A", "B", "C", "D", "E", "F", "G" ] >> arrowArray = arrow.array.ListArray.fromArrays(offsets, values) arrowArray = [ [ "A", "B" ], [ "C" ], [ "D", "E", "F", "G" ] ] >> matlabArray = arrowArray.toMATLAB() matlabArray = 3x1 cell array {2x1 string} {["C" ]} {4x1 string} >> matlabArray{:} ans = 2x1 string array "A" "B" ans = "C" ans = 4x1 string array "D" "E" "F" "G" ``` ### Are these changes tested? Yes. 1. Added a new `tListArray.m` test class. 2. Added a new `tListTraits.m` test class. 3. Updated `arrow.internal.test.tabular.createAllSupportedArrayTypes` to include `ListArray`. ### Are there any user-facing changes? Yes. 1. Users can now create an `arrow.array.ListArray` from an `offsets` and `values` array by calling the static `arrow.array.ListArray.fromArrays(offsets, values)` method. `ListArray`s can be converted into MATLAB `cell` arrays by calling the static `arrow.array.ListArray.toMATLAB` method. ### Notes 1. We chose to use the "missing-class" `missing` value as the `NullSubstitutionValue` for the time being for `ListArray`. However, we eventually want to add `arrow.array.NullArray`, and will most likely want to use the "missing-class" `missing` value to represent `NullArray` values in MATLAB. So, this could cause some ambiguity in the future. We have been thinking about whether we should consider introducing some sort of special "sentinel value" to represent null values when converting to MATLAB `cell` arrays. Perhaps, something like `arrow.Null`, or something to that effect, in order to avoid this ambiguity. If we think it makes sense to do that, we may want to retroactively change the `NullSubstitutionValue` to be `arrow.Null` and break compatibility. Since we are still in pre-`0.1`, we don't think the impact of such a behavior change would be very large. 2. Implementing `ListArray` is fairly involved. So, in the spirit of incremental delivery, we chose not to include an implementation of `arrow.array.ListArray.fromMATLAB` in this initial pull request. We plan on following up with some more changes to `arrow.array.ListArray`. See apache#38353, apache#38354, and apache#38361. 3. Thank you @ sgilmore10 for your help with this pull request! ### Future Directions 1. apache#38353 2. apache#38354 3. apache#38361 4. Consider adding a null sentinel value like `arrow.Null` for conversion to MATLAB `cell` arrays. * Closes: apache#37815 Lead-authored-by: Kevin Gurney <[email protected]> Co-authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…rrow.array.ListArray.fromArrays` (apache#38531) ### Rationale for this change This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid. ### What changes are included in this PR? 1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function. 2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays. 3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function. Supported values for `arrow.array.ValidationMode` include: * `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`. * `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood. * `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood. **Example** ```matlab >> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])) offsets = Int32Array with 6 elements and 0 null values: 0 | 1 | 2 | 3 | 4 | 5 >> values = arrow.array([1, 2, 3]) values = Float64Array with 3 elements and 0 null values: 1 | 2 | 3 >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full") Error using . Offset invariant failure: offset for slot 4 out of bounds: 4 > 3 Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal") Error using . Length spanned by list offsets (5) larger than values array (length 3) Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none") array = ListArray with 5 elements and 0 null values: <Invalid array: Length spanned by list offsets (5) larger than values array (length 3)> ``` ### Are these changes tested? Yes. 1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`. ### Are there any user-facing changes? Yes. 1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair. 2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`. 3. Client code can now create `arrow.array.ValidationMode` enumeration values. **This PR includes breaking changes to public APIs.** Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`. ### Future Directions 1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`. ### Notes 1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#38361 Authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Currently,
arrow.array.ListArray.fromArrays
does not validate the suppliedoffsets
andvalues
to ensure that they represent a valid list. This can result inListArray
s that look like the following:We should add support for validating
offsets
andvalues
toarrow.array.ListArray.fromArrays
and potentially provide a name-value pair likeValidateList = true | false
to thefromArrays
method to let client code control whether or not they want to pay the cost of validation.Component(s)
MATLAB
The text was updated successfully, but these errors were encountered: