Parquet reader can generate incorrect validity buffer information for nested structures #6510

bkirwi · 2024-10-05T00:23:26Z

Describe the bug
Parquet decoding can produce an arrow array that will fail validation. In particular:

Arrays like StructArray generally expect that any nulls in non-nullable child arrays are "masked" - eg. if an the child array is null at some index, the parent array must also be.
The parquet decoders only generate nullability info for nullable fields.

This can go wrong when struct arrays are nested. Suppose we have a nested schema like: {outer: { inner: { primitive: utf8 } } }, where the primitive array is not nullable. The primitive array reader will generate a validity buffer in any case, but whether the outer structs do or not depends on whether they've themselves been declared nullable... so it's quite easy to end up with a struct with no validity buffer that has a non-nullable field of an array that does have nullability info, causing trouble.

To Reproduce
I've added a failing assert on a branch: master...bkirwi:arrow-rs:parquet-bug

(Hopefully it's fairly uncontroversial that that assert should pass? I've also seen this fail in other places - eg. take on nested struct arrays will run validity checks and possibly error out - if this example isn't compelling.

Expected behavior
I believe that the result of decoding should be a valid array according to Arrow's internal checks.

Additional context
While the failing test is a map_array test, I think this is a more general issue - maps just happen to have the mix of deep structure and mixed nullability that trigger the issue.

I am not 100% confident of my diagnosis yet, though I think the general shape is correct; I plan to follow up next week with more information or a fix.

The text was updated successfully, but these errors were encountered:

tustvold · 2024-10-05T06:54:57Z

This sounds a lot like an issue I ran into on #4261.

My recollection is hazy, but I seem to remember this might be a limitation of the way masked nulls are computed during validation, in particular that it isn't recursive and only considers one level up.

Generating the intermediate null masks, whilst technically unnecessary, might be a way to get around this.

bkirwi · 2024-10-05T20:55:33Z

Ah, yeah, that does look similar.

I'd been thinking that we'd need to generate the intermediate null masks, but #4252 reminds me that it would be equally valid (and rather more efficient) to instead skip the null mask for the primitive array. I've pushed a small commit to the branch I linked above that does the naive thing: dropping the generated null mask instead of attaching it to the generated array data. And it does pass tests!

bkirwi · 2024-10-07T20:20:50Z

On reflection, skipping the null buffer for non-nullable fields feels best to me... both generating the intermediate null masks and dropping unnecessary null masks would fix the issue, so might as well bias to the one that involves less memory / compute.

I went ahead and opened the linked branch as a PR, though I'm happy to update it based on any other feedback on the issue here!

alamb · 2024-11-16T14:31:36Z

label_issue.py automatically added labels {'parquet'} from #6524

bkirwi added the bug label Oct 5, 2024

bkirwi mentioned this issue Oct 7, 2024

Skip writing down null buffers for non-nullable primitive arrays #6524

Merged

scovich mentioned this issue Oct 16, 2024

JSON reader produces "invalid" StructArray instances #6574

Closed

tustvold closed this as completed in #6524 Oct 21, 2024

alamb added the parquet Changes to the parquet crate label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet reader can generate incorrect validity buffer information for nested structures #6510

Parquet reader can generate incorrect validity buffer information for nested structures #6510

bkirwi commented Oct 5, 2024 •

edited

Loading

tustvold commented Oct 5, 2024

bkirwi commented Oct 5, 2024

bkirwi commented Oct 7, 2024

alamb commented Nov 16, 2024

Parquet reader can generate incorrect validity buffer information for nested structures #6510

Parquet reader can generate incorrect validity buffer information for nested structures #6510

Comments

bkirwi commented Oct 5, 2024 • edited Loading

tustvold commented Oct 5, 2024

bkirwi commented Oct 5, 2024

bkirwi commented Oct 7, 2024

alamb commented Nov 16, 2024

bkirwi commented Oct 5, 2024 •

edited

Loading