Skip to content

Commit

Permalink
apacheGH-38361: Add validation logic for offsets and values to `a…
Browse files Browse the repository at this point in the history
…rrow.array.ListArray.fromArrays` (apache#38531)

### Rationale for this change

This pull request adds a new `ValidationMode` name-value pair to the `arrow.array.ListArray.fromArrays` function. This allows client code to validate whether provided `offsets` and `values` are valid.

### What changes are included in this PR?

1. Added a new name-value pair `ValidationMode = "None" | "Minimal" (default) | "Full".` to the `arrow.array.ListArrays.fromArrays` function. If `ValidationMode` is set to `"Minimal"` or `"Full"` and the provided `offsets` and `values` arrays are invalid, then an error will be thrown when calling the `arrow.array.ListArrays.fromArrays` function.
2. Set the default `ValidationMode` for `arrow.array.ListArray.fromArrays` to `"Minimal"` to balance usability and performance when creating `ListArray`s. Hopefully, this should help more MATLAB users navigate the complexities of creating `ListArray`s "from scratch" using `offsets` and `values` arrays.
3. Added a new `arrow.array.ValidationMode` enumeration class. This is used as the type of the `ValidationMode` name-value pair on the `arrow.array.ListArray.fromArrays` function.

Supported values for `arrow.array.ValidationMode` include:
* `arrow.array.ValidationMode.None` - Do no validation checks on the given `Array`.
* `arrow.array.ValidationMode.Minimal` - Do relatively inexpensive validation checks on the given `Array`. Delegates to the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood.
* `arrow.array.ValidationMode.Full` - Do expensive / robust validation checks on the given `Array`. Delegates to the C++ [`Array::ValidateFull`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L209) method under the hood.

**Example**
```matlab
>> offsets = arrow.array(int32([0, 1, 2, 3, 4, 5]))

offsets = 

  Int32Array with 6 elements and 0 null values:

    0 | 1 | 2 | 3 | 4 | 5

>> values = arrow.array([1, 2, 3])

values = 

  Float64Array with 3 elements and 0 null values:

    1 | 2 | 3

>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="full")
Error using  . 
Offset invariant failure: offset for slot 4 out of bounds: 4 > 3

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="minimal")
Error using  . 
Length spanned by list offsets (5) larger than values array (length 3)

Error in arrow.array.ListArray.fromArrays (line 108)
            proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
 
>> array = arrow.array.ListArray.fromArrays(offsets, values, ValidationMode="none")

array = 

  ListArray with 5 elements and 0 null values:

<Invalid array: Length spanned by list offsets (5) larger than values array (length 3)>
```

### Are these changes tested?

Yes.

1. Added new test cases for verifying `ValidationMode` behavior to `tListArray.m`.

### Are there any user-facing changes?

Yes.

1. Client code can now control validation behavior when calling `arrow.array.ListArray.fromArrays` by using the new `ValidationMode` name-value pair.
2. By default, an error will now be thrown by `arrow.array.ListArray.fromArrays` for certain invalid combinations of `offsets` and `values`. In other words, `arrow.array.ListArray.fromArrays` will call the C++ method `Array::Validate` by default, which corresponds to `arrow.array.ValidationMode.Minimal`.
3. Client code can now create `arrow.array.ValidationMode` enumeration values.

**This PR includes breaking changes to public APIs.**

Previously, all `offsets` and `values` would be accepted by the `arrow.array.ListArray.fromArrays` function. However, this pull request changes the default behavior to call the C++ [`Array::Validate`](https://github.com/apache/arrow/blob/efd945d437a8df12b200c1da20573216f2a17feb/cpp/src/arrow/array/array_base.h#L200) method under the hood, which means that some previously accepted `offsets` and `values` will now result in a validation error. This can be worked around by setting `ValidationMode` to `"None"` when calling `arrow.array.ListArray.fromArrays`.

### Future Directions

1. Currently `ValidationMode` has only been added to the `arrow.array.ListArray.fromArrays` method. However, in the future, it may make sense to generalize validation behavior and provide `ValidationMode` on other `fromMATLAB` and `fromArrays` methods for other `Array` types. We may also want to add a stand-alone `validate` method on all `arrow.array.Array` classes (apache#38532). We decided to start with `ListArray` as an incremental first step since we suspect creating valid `ListArray`s from `offsets` and `values` will generally be more error prone than creating simpler `Array` types like `Float64Array` or `StringArray`.

### Notes

1. We chose to set the default `ValidationMode` value to `arrow.array.ValidationMode.Minimal` to balance usability and performance. If this ends up causing major performance issues in common workflows, then we could consider changing this to `arrow.array.ValidationMode.None` in the future. 
2. Thank you @ sgilmore10 for your help with this pull request!
* Closes: apache#38361 

Authored-by: Kevin Gurney <[email protected]>
Signed-off-by: Kevin Gurney <[email protected]>
  • Loading branch information
kevingurney authored and dgreiss committed Feb 17, 2024
1 parent 2afcefe commit 18931f1
Show file tree
Hide file tree
Showing 7 changed files with 204 additions and 2 deletions.
36 changes: 36 additions & 0 deletions matlab/src/cpp/arrow/matlab/array/proxy/list_array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
// specific language governing permissions and limitations
// under the License.

#include "arrow/matlab/array/validation_mode.h"
#include "arrow/matlab/array/proxy/list_array.h"
#include "arrow/matlab/array/proxy/numeric_array.h"
#include "arrow/matlab/array/proxy/wrap.h"
Expand All @@ -26,6 +27,7 @@ namespace arrow::matlab::array::proxy {
ListArray::ListArray(std::shared_ptr<arrow::ListArray> list_array) : proxy::Array{std::move(list_array)} {
REGISTER_METHOD(ListArray, getValues);
REGISTER_METHOD(ListArray, getOffsets);
REGISTER_METHOD(ListArray, validate);
}

libmexclass::proxy::MakeResult ListArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
Expand Down Expand Up @@ -100,4 +102,38 @@ namespace arrow::matlab::array::proxy {
mda::ArrayFactory factory;
context.outputs[0] = factory.createScalar(offsets_int32_array_proxy_id);
}

void ListArray::validate(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
mda::StructArray args = context.inputs[0];
const mda::TypedArray<std::uint8_t> validation_mode_mda = args[0]["ValidationMode"];
const auto validation_mode_integer = uint8_t(validation_mode_mda[0]);
// Convert integer representation to ValidationMode enum.
const auto validation_mode = static_cast<ValidationMode>(validation_mode_integer);
switch (validation_mode) {
case ValidationMode::None: {
// Do nothing.
break;
}
case ValidationMode::Minimal: {
MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(array->Validate(),
context,
error::ARRAY_VALIDATE_MINIMAL_FAILED);
break;
}
case ValidationMode::Full: {
MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(array->ValidateFull(),
context,
error::ARRAY_VALIDATE_FULL_FAILED);
break;
}
default: {
// Throw an error if an unsupported enumeration value is provided.
const auto msg = "Unsupported ValidationMode enumeration value: " + std::to_string(validation_mode_integer);
context.error = libmexclass::error::Error{error::ARRAY_VALIDATE_UNSUPPORTED_ENUM, msg};
return;
}
}
}

}
1 change: 1 addition & 0 deletions matlab/src/cpp/arrow/matlab/array/proxy/list_array.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ class ListArray : public arrow::matlab::array::proxy::Array {
protected:
void getValues(libmexclass::proxy::method::Context& context);
void getOffsets(libmexclass::proxy::method::Context& context);
void validate(libmexclass::proxy::method::Context& context);

};

Expand Down
30 changes: 30 additions & 0 deletions matlab/src/cpp/arrow/matlab/array/validation_mode.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <cstdint>

namespace arrow::matlab::array {

enum class ValidationMode : uint8_t {
None = 0,
Minimal = 1,
Full = 2
};

}
3 changes: 3 additions & 0 deletions matlab/src/cpp/arrow/matlab/error/error.h
Original file line number Diff line number Diff line change
Expand Up @@ -203,4 +203,7 @@ namespace arrow::matlab::error {
static const char* BUFFER_VIEW_OR_COPY_FAILED = "arrow:buffer:ViewOrCopyFailed";
static const char* ARRAY_PRETTY_PRINT_FAILED = "arrow:array:PrettyPrintFailed";
static const char* TABULAR_GET_ROW_AS_STRING_FAILED = "arrow:tabular:GetRowAsStringFailed";
static const char* ARRAY_VALIDATE_MINIMAL_FAILED = "arrow:array:ValidateMinimalFailed";
static const char* ARRAY_VALIDATE_FULL_FAILED = "arrow:array:ValidateFullFailed";
static const char* ARRAY_VALIDATE_UNSUPPORTED_ENUM = "arrow:array:ValidateUnsupportedEnum";
}
7 changes: 5 additions & 2 deletions matlab/src/matlab/+arrow/+array/ListArray.m
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,9 @@
offsets (1, 1) arrow.array.Int32Array
values (1, 1) arrow.array.Array
opts.Valid
opts.ValidationMode (1, 1) arrow.array.ValidationMode = arrow.array.ValidationMode.Minimal
end

import arrow.internal.validate.parseValid

if nargin < 2
Expand All @@ -100,9 +101,11 @@
ValuesProxyID=valuesProxyID, ...
Valid=validElements ...
);

proxyName = "arrow.array.proxy.ListArray";
proxy = arrow.internal.proxy.create(proxyName, args);
% Validate the provided offsets and values.
proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
array = arrow.array.ListArray(proxy);
end

Expand Down
24 changes: 24 additions & 0 deletions matlab/src/matlab/+arrow/+array/ValidationMode.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
% Mode to use for Array validation.

% Licensed to the Apache Software Foundation (ASF) under one or more
% contributor license agreements. See the NOTICE file distributed with
% this work for additional information regarding copyright ownership.
% The ASF licenses this file to you under the Apache License, Version
% 2.0 (the "License"); you may not use this file except in compliance
% with the License. You may obtain a copy of the License at
%
% http://www.apache.org/licenses/LICENSE-2.0
%
% Unless required by applicable law or agreed to in writing, software
% distributed under the License is distributed on an "AS IS" BASIS,
% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
% implied. See the License for the specific language governing
% permissions and limitations under the License.

classdef ValidationMode < uint8
enumeration
None (0)
Minimal (1)
Full (2)
end
end
105 changes: 105 additions & 0 deletions matlab/test/arrow/array/tListArray.m
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

properties (TestParameter)
TestArrowArray
TestValidationModeArray
end

methods (TestParameterDefinition, Static)
Expand Down Expand Up @@ -113,6 +114,29 @@
);
end

function TestValidationModeArray = initializeTestValidationModeArray()
%% Valid ListArray
Offsets = arrow.array(int32([0, 1, 2, 3]));
Values = arrow.array([1, 2, 3]);

TestValidationModeArray.ValidList = struct( ...
Offsets=Offsets, ...
Values=Values, ...
Valid=true ...
);

%% Invalid ListArray
% Incorrect number of offsets (length should be 1 more than the number of Values).
Offsets = arrow.array(int32([0, 1, 2, 3, 4, 5]));
Values = arrow.array([1, 2, 3]);

TestValidationModeArray.InvalidList = struct( ...
Offsets=Offsets, ...
Values=Values, ...
Valid=false ...
);
end

end

methods (Test)
Expand Down Expand Up @@ -160,6 +184,87 @@ function TestErrorIfEmptyOffsets(testCase)
testCase.verifyError(fcn, "arrow:array:ListArrayFromArraysFailed");
end

function TestValidationModeDefault(testCase, TestValidationModeArray)
% Verify that the default ValidationMode value for the
% arrow.array.ListArray.fromArrays method is
% arrow.array.ValidationMode.Minimal.
offsets = TestValidationModeArray.Offsets;
values = TestValidationModeArray.Values;
valid = TestValidationModeArray.Valid;
fcn = @() arrow.array.ListArray.fromArrays(offsets, values);
if valid
testCase.verifyWarningFree(fcn);
else
testCase.verifyError(fcn, "arrow:array:ValidateMinimalFailed");
end
end

function TestValidationModeNone(testCase, TestValidationModeArray)
% Verify that no error is thrown when supplying the
% ValidatationMode name-value pair, with a value of
% arrow.array.ValidationMode.None, to the
% arrow.array.ListArray.fromArrays method.
offsets = TestValidationModeArray.Offsets;
values = TestValidationModeArray.Values;
validationMode = arrow.array.ValidationMode.None;
fcn = @() arrow.array.ListArray.fromArrays(offsets, values, ValidationMode=validationMode);
testCase.verifyWarningFree(fcn);
end

function TestValidationModeMinimal(testCase, TestValidationModeArray)
% Verify that an error of type arrow:array:ValidateMinimalFailed
% is thrown when supplying the ValidatationMode name-value pair,
% with a value of arrow.array.ValidationMode.Minimal, to the
% arrow.array.ListArray.fromArrays method, if the provided offsets
% and values arrays are invalid.
offsets = TestValidationModeArray.Offsets;
values = TestValidationModeArray.Values;
valid = TestValidationModeArray.Valid;
validationMode = arrow.array.ValidationMode.Minimal;
fcn = @() arrow.array.ListArray.fromArrays(offsets, values, ValidationMode=validationMode);
if valid
testCase.verifyWarningFree(fcn);
else
testCase.verifyError(fcn, "arrow:array:ValidateMinimalFailed");
end
end

function TestValidationModeFull(testCase, TestValidationModeArray)
% Verify that an error of type arrow:array:ValidateFullFailed
% is thrown when supplying the ValidatationMode name-value pair,
% with a value of arrow.array.ValidationMode.Full, to the
% arrow.array.ListArray.fromArrays method, if the provided offsets
% and values arrays are invalid.
offsets = TestValidationModeArray.Offsets;
values = TestValidationModeArray.Values;
validationMode = arrow.array.ValidationMode.Full;
valid = TestValidationModeArray.Valid;
fcn = @() arrow.array.ListArray.fromArrays(offsets, values, ValidationMode=validationMode);
if valid
testCase.verifyWarningFree(fcn);
else
testCase.verifyError(fcn, "arrow:array:ValidateFullFailed");
end
end

function TestValidationModeUnsupportedEnum(testCase)
% Verify that an error of type arrow:array:ValidateUnsupportedEnum
% is thrown when an unsupported integer enumeration value is
% supplied for the ValidatationMode parameter to the internal
% C++ ListArray Proxy validate method.
offsets = arrow.array.Int32Array.fromMATLAB(int32([0, 1, 2]));
values = arrow.array.Float64Array.fromMATLAB([1, 2, 3]);
array = arrow.array.ListArray.fromArrays(offsets, values);
% Get the underlying Proxy instance from the ListArray.
proxy = array.Proxy;
% Call the internal Proxy method "validate" with an unsupported
% integer ValidationMode value.
validationMode = uint8(3);
args = struct(ValidationMode=validationMode);
fcn = @() proxy.validate(args);
testCase.verifyError(fcn, "arrow:array:ValidateUnsupportedEnum");
end

end

end

0 comments on commit 18931f1

Please sign in to comment.