-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17964: [C++] Range data comparison for struct type may go out of bounds #14347
Conversation
We should probably add a unit test for this |
Agreed. I'm not sure how I'd like to test this just yet, so ideas are welcomed. At first, I thought I could somehow use |
@lidavidm, are you aware of any example test case that checks for accessing an index out of bounds? |
No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type? Regardless, it seems something like this would suffice? auto lhs = ArrayFromJSON(..., "[{"a": 2, "b": 3}]");
auto rhs = ArrayFromJSON(..., "[{"a": 2}]");
ASSERT_FALSE(lhs->Equals(*rhs)); // assuming this segfaults without this PR |
Right, since the data types are supposed to match, this PR is only guarding against invalid data. |
I run into the issue that led to this PR when a unit test compared data of two different types, one is the expected and another is the unexpected. My quick impression was that there seemed to be sufficiently many test cases that compare this way, without trying to compare types or to validate first. I figured it would be easier to fix the comparison in one place than to fix many test cases.
I may be missing something, but I don't think validating would have fixed the issue, because I believe in the above case both expected and actual data were valid, since they both were obtained via JSON parsing; they were just of different types.
I didn't check your particular case yet, but I did run into a segfault in the case I described above. My vote would be to minimize segfaults as an indication of test failure, if only because such a failure would be less convenient to work with. |
I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues) |
It does start with a type comparison, it's also mentioned above: arrow/cpp/src/arrow/compare.cc Lines 164 to 169 in d4190cc
and you can see an example of type checking here: arrow/cpp/src/arrow/compare.cc Lines 547 to 554 in d4190cc
|
AFAICS, the path of invocations seems to be |
Right. As I just noted in crossing, I suspect the issue is with |
Can you show a snippet that would show the issue? |
Good chances I could. I'll need a bit of time to get to this, though. |
I added a test that checks for this by comparing a badly structures array with a correctly structured one:
|
Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to (note: "invalid data" here is a badly structured array) |
This circles back to points we discussed. I can understand the requirement of passing valid data in a correct Arrow app, as well as in correct Arrow code, but less so during its development, where incorrect code frequently occurs. This PR aims to make (failure analysis during) development easier, given that its runtime cost is small. For the purpose of cost, I think the calls to |
But again, during development you can call |
Well, not really randomly, but I understand what you're saying. While I agree one can easily add validation calls to the code while developing, I still think it's not convenient because in many cases the segfault is not making it easy to determine which structure is invalid. Moreover, a segfault is a relatively good result; when luck betrays, the result might be a memory leak or a buffer overrun that would make analysis of the root cause much harder. Having said that, since you seem to be firm in your opinion, I'll stop pushing for this PR. It's not high priority. |
See https://issues.apache.org/jira/browse/ARROW-17964