From 9fae94187bf22f4c9e249e1952694dd0fd735a4b Mon Sep 17 00:00:00 2001 From: Kevin Gurney Date: Tue, 31 Oct 2023 14:16:07 -0400 Subject: [PATCH] GH-38361: Add validation logic for `offsets` and `values` to `arrow.array.ListArray.fromArrays` (#38531) ### Rationale for this change This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid. ### What changes are included in this PR? 1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function. 2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays. 3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function. Supported values for `arrow.array.ValidationMode` include: * `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`. * `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood. * `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood. **Example** ```matlab >> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])) offsets = Int32Array with 6 elements and 0 null values: 0 | 1 | 2 | 3 | 4 | 5 >> values = arrow.array([1, 2, 3]) values = Float64Array with 3 elements and 0 null values: 1 | 2 | 3 >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full") Error using . Offset invariant failure: offset for slot 4 out of bounds: 4 > 3 Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal") Error using . Length spanned by list offsets (5) larger than values array (length 3) Error in arrow.array.ListArray.fromArrays (line 108) proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); >> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none") array = ListArray with 5 elements and 0 null values: ``` ### Are these changes tested? Yes. 1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`. ### Are there any user-facing changes? Yes. 1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair. 2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`. 3. Client code can now create `arrow.array.ValidationMode` enumeration values. **This PR includes breaking changes to public APIs.** Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`. ### Future Directions 1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`. ### Notes 1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: #38361 Authored-by: Kevin Gurney Signed-off-by: Kevin Gurney --- .../arrow/matlab/array/proxy/list_array.cc | 36 ++++++ .../cpp/arrow/matlab/array/proxy/list_array.h | 1 + .../cpp/arrow/matlab/array/validation_mode.h | 30 +++++ matlab/src/cpp/arrow/matlab/error/error.h | 3 + matlab/src/matlab/+arrow/+array/ListArray.m | 7 +- .../src/matlab/+arrow/+array/ValidationMode.m | 24 ++++ matlab/test/arrow/array/tListArray.m | 105 ++++++++++++++++++ 7 files changed, 204 insertions(+), 2 deletions(-) create mode 100644 matlab/src/cpp/arrow/matlab/array/validation_mode.h create mode 100644 matlab/src/matlab/+arrow/+array/ValidationMode.m diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/list_array.cc b/matlab/src/cpp/arrow/matlab/array/proxy/list_array.cc index fc75e55dd6012..941e658c25127 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/list_array.cc +++ b/matlab/src/cpp/arrow/matlab/array/proxy/list_array.cc @@ -15,6 +15,7 @@ // specific language governing permissions and limitations // under the License. +#include "arrow/matlab/array/validation_mode.h" #include "arrow/matlab/array/proxy/list_array.h" #include "arrow/matlab/array/proxy/numeric_array.h" #include "arrow/matlab/array/proxy/wrap.h" @@ -26,6 +27,7 @@ namespace arrow::matlab::array::proxy { ListArray::ListArray(std::shared_ptr list_array) : proxy::Array{std::move(list_array)} { REGISTER_METHOD(ListArray, getValues); REGISTER_METHOD(ListArray, getOffsets); + REGISTER_METHOD(ListArray, validate); } libmexclass::proxy::MakeResult ListArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { @@ -100,4 +102,38 @@ namespace arrow::matlab::array::proxy { mda::ArrayFactory factory; context.outputs[0] = factory.createScalar(offsets_int32_array_proxy_id); } + + void ListArray::validate(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + mda::StructArray args = context.inputs[0]; + const mda::TypedArray validation_mode_mda = args[0]["ValidationMode"]; + const auto validation_mode_integer = uint8_t(validation_mode_mda[0]); + // Convert integer representation to ValidationMode enum. + const auto validation_mode = static_cast(validation_mode_integer); + switch (validation_mode) { + case ValidationMode::None: { + // Do nothing. + break; + } + case ValidationMode::Minimal: { + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(array->Validate(), + context, + error::ARRAY_VALIDATE_MINIMAL_FAILED); + break; + } + case ValidationMode::Full: { + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(array->ValidateFull(), + context, + error::ARRAY_VALIDATE_FULL_FAILED); + break; + } + default: { + // Throw an error if an unsupported enumeration value is provided. + const auto msg = "Unsupported ValidationMode enumeration value: " + std::to_string(validation_mode_integer); + context.error = libmexclass::error::Error{error::ARRAY_VALIDATE_UNSUPPORTED_ENUM, msg}; + return; + } + } + } + } diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/list_array.h b/matlab/src/cpp/arrow/matlab/array/proxy/list_array.h index 8db6b6bf1d632..1f34b11406594 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/list_array.h +++ b/matlab/src/cpp/arrow/matlab/array/proxy/list_array.h @@ -32,6 +32,7 @@ class ListArray : public arrow::matlab::array::proxy::Array { protected: void getValues(libmexclass::proxy::method::Context& context); void getOffsets(libmexclass::proxy::method::Context& context); + void validate(libmexclass::proxy::method::Context& context); }; diff --git a/matlab/src/cpp/arrow/matlab/array/validation_mode.h b/matlab/src/cpp/arrow/matlab/array/validation_mode.h new file mode 100644 index 0000000000000..92e10f47aa4e7 --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/array/validation_mode.h @@ -0,0 +1,30 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include + +namespace arrow::matlab::array { + + enum class ValidationMode : uint8_t { + None = 0, + Minimal = 1, + Full = 2 + }; + +} diff --git a/matlab/src/cpp/arrow/matlab/error/error.h b/matlab/src/cpp/arrow/matlab/error/error.h index 33e80bca8cc5a..5aa8f05c8c315 100644 --- a/matlab/src/cpp/arrow/matlab/error/error.h +++ b/matlab/src/cpp/arrow/matlab/error/error.h @@ -203,4 +203,7 @@ namespace arrow::matlab::error { static const char* BUFFER_VIEW_OR_COPY_FAILED = "arrow:buffer:ViewOrCopyFailed"; static const char* ARRAY_PRETTY_PRINT_FAILED = "arrow:array:PrettyPrintFailed"; static const char* TABULAR_GET_ROW_AS_STRING_FAILED = "arrow:tabular:GetRowAsStringFailed"; + static const char* ARRAY_VALIDATE_MINIMAL_FAILED = "arrow:array:ValidateMinimalFailed"; + static const char* ARRAY_VALIDATE_FULL_FAILED = "arrow:array:ValidateFullFailed"; + static const char* ARRAY_VALIDATE_UNSUPPORTED_ENUM = "arrow:array:ValidateUnsupportedEnum"; } diff --git a/matlab/src/matlab/+arrow/+array/ListArray.m b/matlab/src/matlab/+arrow/+array/ListArray.m index f8fd934b7c448..9a93c39ec6193 100644 --- a/matlab/src/matlab/+arrow/+array/ListArray.m +++ b/matlab/src/matlab/+arrow/+array/ListArray.m @@ -79,8 +79,9 @@ offsets (1, 1) arrow.array.Int32Array values (1, 1) arrow.array.Array opts.Valid + opts.ValidationMode (1, 1) arrow.array.ValidationMode = arrow.array.ValidationMode.Minimal end - + import arrow.internal.validate.parseValid if nargin < 2 @@ -100,9 +101,11 @@ ValuesProxyID=valuesProxyID, ... Valid=validElements ... ); - + proxyName = "arrow.array.proxy.ListArray"; proxy = arrow.internal.proxy.create(proxyName, args); + % Validate the provided offsets and values. + proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); array = arrow.array.ListArray(proxy); end diff --git a/matlab/src/matlab/+arrow/+array/ValidationMode.m b/matlab/src/matlab/+arrow/+array/ValidationMode.m new file mode 100644 index 0000000000000..3442bcccf725b --- /dev/null +++ b/matlab/src/matlab/+arrow/+array/ValidationMode.m @@ -0,0 +1,24 @@ +% Mode to use for Array validation. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef ValidationMode < uint8 + enumeration + None (0) + Minimal (1) + Full (2) + end +end \ No newline at end of file diff --git a/matlab/test/arrow/array/tListArray.m b/matlab/test/arrow/array/tListArray.m index 1ebf66e2f0999..07304eb384299 100644 --- a/matlab/test/arrow/array/tListArray.m +++ b/matlab/test/arrow/array/tListArray.m @@ -23,6 +23,7 @@ properties (TestParameter) TestArrowArray + TestValidationModeArray end methods (TestParameterDefinition, Static) @@ -113,6 +114,29 @@ ); end + function TestValidationModeArray = initializeTestValidationModeArray() + %% Valid ListArray + Offsets = arrow.array(int32([0, 1, 2, 3])); + Values = arrow.array([1, 2, 3]); + + TestValidationModeArray.ValidList = struct( ... + Offsets=Offsets, ... + Values=Values, ... + Valid=true ... + ); + + %% Invalid ListArray + % Incorrect number of offsets (length should be 1 more than the number of Values). + Offsets = arrow.array(int32([0, 1, 2, 3, 4, 5])); + Values = arrow.array([1, 2, 3]); + + TestValidationModeArray.InvalidList = struct( ... + Offsets=Offsets, ... + Values=Values, ... + Valid=false ... + ); + end + end methods (Test) @@ -160,6 +184,87 @@ function TestErrorIfEmptyOffsets(testCase) testCase.verifyError(fcn, "arrow:array:ListArrayFromArraysFailed"); end + function TestValidationModeDefault(testCase, TestValidationModeArray) + % Verify that the default ValidationMode value for the + % arrow.array.ListArray.fromArrays method is + % arrow.array.ValidationMode.Minimal. + offsets = TestValidationModeArray.Offsets; + values = TestValidationModeArray.Values; + valid = TestValidationModeArray.Valid; + fcn = @() arrow.array.ListArray.fromArrays(offsets, values); + if valid + testCase.verifyWarningFree(fcn); + else + testCase.verifyError(fcn, "arrow:array:ValidateMinimalFailed"); + end + end + + function TestValidationModeNone(testCase, TestValidationModeArray) + % Verify that no error is thrown when supplying the + % ValidatationMode name-value pair, with a value of + % arrow.array.ValidationMode.None, to the + % arrow.array.ListArray.fromArrays method. + offsets = TestValidationModeArray.Offsets; + values = TestValidationModeArray.Values; + validationMode = arrow.array.ValidationMode.None; + fcn = @() arrow.array.ListArray.fromArrays(offsets, values, ValidationMode=validationMode); + testCase.verifyWarningFree(fcn); + end + + function TestValidationModeMinimal(testCase, TestValidationModeArray) + % Verify that an error of type arrow:array:ValidateMinimalFailed + % is thrown when supplying the ValidatationMode name-value pair, + % with a value of arrow.array.ValidationMode.Minimal, to the + % arrow.array.ListArray.fromArrays method, if the provided offsets + % and values arrays are invalid. + offsets = TestValidationModeArray.Offsets; + values = TestValidationModeArray.Values; + valid = TestValidationModeArray.Valid; + validationMode = arrow.array.ValidationMode.Minimal; + fcn = @() arrow.array.ListArray.fromArrays(offsets, values, ValidationMode=validationMode); + if valid + testCase.verifyWarningFree(fcn); + else + testCase.verifyError(fcn, "arrow:array:ValidateMinimalFailed"); + end + end + + function TestValidationModeFull(testCase, TestValidationModeArray) + % Verify that an error of type arrow:array:ValidateFullFailed + % is thrown when supplying the ValidatationMode name-value pair, + % with a value of arrow.array.ValidationMode.Full, to the + % arrow.array.ListArray.fromArrays method, if the provided offsets + % and values arrays are invalid. + offsets = TestValidationModeArray.Offsets; + values = TestValidationModeArray.Values; + validationMode = arrow.array.ValidationMode.Full; + valid = TestValidationModeArray.Valid; + fcn = @() arrow.array.ListArray.fromArrays(offsets, values, ValidationMode=validationMode); + if valid + testCase.verifyWarningFree(fcn); + else + testCase.verifyError(fcn, "arrow:array:ValidateFullFailed"); + end + end + + function TestValidationModeUnsupportedEnum(testCase) + % Verify that an error of type arrow:array:ValidateUnsupportedEnum + % is thrown when an unsupported integer enumeration value is + % supplied for the ValidatationMode parameter to the internal + % C++ ListArray Proxy validate method. + offsets = arrow.array.Int32Array.fromMATLAB(int32([0, 1, 2])); + values = arrow.array.Float64Array.fromMATLAB([1, 2, 3]); + array = arrow.array.ListArray.fromArrays(offsets, values); + % Get the underlying Proxy instance from the ListArray. + proxy = array.Proxy; + % Call the internal Proxy method "validate" with an unsupported + % integer ValidationMode value. + validationMode = uint8(3); + args = struct(ValidationMode=validationMode); + fcn = @() proxy.validate(args); + testCase.verifyError(fcn, "arrow:array:ValidateUnsupportedEnum"); + end + end end