-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet reader can generate incorrect validity buffer information for nested structures #6510
Comments
This sounds a lot like an issue I ran into on #4261. My recollection is hazy, but I seem to remember this might be a limitation of the way masked nulls are computed during validation, in particular that it isn't recursive and only considers one level up. Generating the intermediate null masks, whilst technically unnecessary, might be a way to get around this. |
Ah, yeah, that does look similar. I'd been thinking that we'd need to generate the intermediate null masks, but #4252 reminds me that it would be equally valid (and rather more efficient) to instead skip the null mask for the primitive array. I've pushed a small commit to the branch I linked above that does the naive thing: dropping the generated null mask instead of attaching it to the generated array data. And it does pass tests! |
On reflection, skipping the null buffer for non-nullable fields feels best to me... both generating the intermediate null masks and dropping unnecessary null masks would fix the issue, so might as well bias to the one that involves less memory / compute. I went ahead and opened the linked branch as a PR, though I'm happy to update it based on any other feedback on the issue here! |
|
Describe the bug
Parquet decoding can produce an arrow array that will fail validation. In particular:
StructArray
generally expect that any nulls in non-nullable child arrays are "masked" - eg. if an the child array is null at some index, the parent array must also be.This can go wrong when struct arrays are nested. Suppose we have a nested schema like:
{outer: { inner: { primitive: utf8 } } }
, where theprimitive
array is not nullable. The primitive array reader will generate a validity buffer in any case, but whether the outer structs do or not depends on whether they've themselves been declared nullable... so it's quite easy to end up with a struct with no validity buffer that has a non-nullable field of an array that does have nullability info, causing trouble.To Reproduce
I've added a failing assert on a branch: master...bkirwi:arrow-rs:parquet-bug
(Hopefully it's fairly uncontroversial that that assert should pass? I've also seen this fail in other places - eg.
take
on nested struct arrays will run validity checks and possibly error out - if this example isn't compelling.Expected behavior
I believe that the result of decoding should be a valid array according to Arrow's internal checks.
Additional context
While the failing test is a
map_array
test, I think this is a more general issue - maps just happen to have the mix of deep structure and mixed nullability that trigger the issue.I am not 100% confident of my diagnosis yet, though I think the general shape is correct; I plan to follow up next week with more information or a fix.
The text was updated successfully, but these errors were encountered: