Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17964: [C++] Range data comparison for struct type may go out of bounds #14347

Closed
wants to merge 2 commits into from

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Oct 7, 2022

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

@lidavidm
Copy link
Member

lidavidm commented Oct 7, 2022

We should probably add a unit test for this

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 7, 2022

We should probably add a unit test for this

Agreed. I'm not sure how I'd like to test this just yet, so ideas are welcomed. At first, I thought I could somehow use AssertDatumsEqual, but it cannot be directly negated.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 7, 2022

@lidavidm, are you aware of any example test case that checks for accessing an index out of bounds?

@lidavidm
Copy link
Member

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

Regardless, it seems something like this would suffice?

auto lhs = ArrayFromJSON(..., "[{"a": 2, "b": 3}]");
auto rhs = ArrayFromJSON(..., "[{"a": 2}]");
ASSERT_FALSE(lhs->Equals(*rhs));  // assuming this segfaults without this PR

@pitrou
Copy link
Member

pitrou commented Oct 10, 2022

Right, since the data types are supposed to match, this PR is only guarding against invalid data.
It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 10, 2022

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

I run into the issue that led to this PR when a unit test compared data of two different types, one is the expected and another is the unexpected. My quick impression was that there seemed to be sufficiently many test cases that compare this way, without trying to compare types or to validate first. I figured it would be easier to fix the comparison in one place than to fix many test cases.

Right, since the data types are supposed to match, this PR is only guarding against invalid data.
It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

I may be missing something, but I don't think validating would have fixed the issue, because I believe in the above case both expected and actual data were valid, since they both were obtained via JSON parsing; they were just of different types.

Regardless, it seems something like this would suffice? ... // assuming this segfaults without this PR

I didn't check your particular case yet, but I did run into a segfault in the case I described above. My vote would be to minimize segfaults as an indication of test failure, if only because such a failure would be less convenient to work with.

@lidavidm
Copy link
Member

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

@pitrou
Copy link
Member

pitrou commented Oct 10, 2022

It does start with a type comparison, it's also mentioned above:

class RangeDataEqualsImpl {
public:
// PRE-CONDITIONS:
// - the types are equal
// - the ranges are in bounds
RangeDataEqualsImpl(const EqualOptions& options, bool floating_approximate,

and you can see an example of type checking here:
bool CompareArrayRanges(const ArrayData& left, const ArrayData& right,
int64_t left_start_idx, int64_t left_end_idx,
int64_t right_start_idx, const EqualOptions& options,
bool floating_approximate) {
if (left.type->id() != right.type->id() ||
!TypeEquals(*left.type, *right.type, false /* check_metadata */)) {
return false;
}

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 10, 2022

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

AFAICS, the path of invocations seems to be Array::Equals -> ArrayEquals -> ArrayRangeEquals -> CompareArrayRanges -> TypeEqualsVisitor::Visit(const StructType &). This path does not go thorough TypeEqualsVisitor::VisitChildren, for example, that would have checked for equal types. Perhaps all visitors methods in TypeEqualsVisitor should start with a type comparison.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 10, 2022

It does start with a type comparison

Right. As I just noted in crossing, I suspect the issue is with TypeEqualsVisitor which it uses.

@pitrou
Copy link
Member

pitrou commented Oct 10, 2022

Can you show a snippet that would show the issue?

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 10, 2022

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 13, 2022

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

I added a test that checks for this by comparing a badly structures array with a correctly structured one:

  1. The final check in the test covers what you asked for. The other checks appearing before it are for normal conditions.
  2. Without the change in RangeDataEqualsImpl, the final check fails due to not observing a failure.
  3. Without the change in PrintDiff (but with the change in RangeDataEqualsImpl), the final check leads to a segmentation fault.

@pitrou
Copy link
Member

pitrou commented Oct 13, 2022

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 13, 2022

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

This circles back to points we discussed. I can understand the requirement of passing valid data in a correct Arrow app, as well as in correct Arrow code, but less so during its development, where incorrect code frequently occurs. This PR aims to make (failure analysis during) development easier, given that its runtime cost is small. For the purpose of cost, I think the calls to ValidateFull in PrintDiff shouldn't count because they can be removed - I only give them for reproducibility without a segmentation fault.

@pitrou
Copy link
Member

pitrou commented Oct 13, 2022

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 13, 2022

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

Well, not really randomly, but I understand what you're saying. While I agree one can easily add validation calls to the code while developing, I still think it's not convenient because in many cases the segfault is not making it easy to determine which structure is invalid. Moreover, a segfault is a relatively good result; when luck betrays, the result might be a memory leak or a buffer overrun that would make analysis of the root cause much harder.

Having said that, since you seem to be firm in your opinion, I'll stop pushing for this PR. It's not high priority.

@rtpsw rtpsw closed this Oct 13, 2022
@rtpsw rtpsw deleted the ARROW-17964 branch October 13, 2022 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants