Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet reader can generate incorrect validity buffer information for nested structures #6510

Closed
bkirwi opened this issue Oct 5, 2024 · 4 comments · Fixed by #6524
Closed
Labels
bug parquet Changes to the parquet crate

Comments

@bkirwi
Copy link
Contributor

bkirwi commented Oct 5, 2024

Describe the bug
Parquet decoding can produce an arrow array that will fail validation. In particular:

  • Arrays like StructArray generally expect that any nulls in non-nullable child arrays are "masked" - eg. if an the child array is null at some index, the parent array must also be.
  • The parquet decoders only generate nullability info for nullable fields.

This can go wrong when struct arrays are nested. Suppose we have a nested schema like: {outer: { inner: { primitive: utf8 } } }, where the primitive array is not nullable. The primitive array reader will generate a validity buffer in any case, but whether the outer structs do or not depends on whether they've themselves been declared nullable... so it's quite easy to end up with a struct with no validity buffer that has a non-nullable field of an array that does have nullability info, causing trouble.

To Reproduce
I've added a failing assert on a branch: master...bkirwi:arrow-rs:parquet-bug

(Hopefully it's fairly uncontroversial that that assert should pass? I've also seen this fail in other places - eg. take on nested struct arrays will run validity checks and possibly error out - if this example isn't compelling.

Expected behavior
I believe that the result of decoding should be a valid array according to Arrow's internal checks.

Additional context
While the failing test is a map_array test, I think this is a more general issue - maps just happen to have the mix of deep structure and mixed nullability that trigger the issue.

I am not 100% confident of my diagnosis yet, though I think the general shape is correct; I plan to follow up next week with more information or a fix.

@bkirwi bkirwi added the bug label Oct 5, 2024
@tustvold
Copy link
Contributor

tustvold commented Oct 5, 2024

This sounds a lot like an issue I ran into on #4261.

My recollection is hazy, but I seem to remember this might be a limitation of the way masked nulls are computed during validation, in particular that it isn't recursive and only considers one level up.

Generating the intermediate null masks, whilst technically unnecessary, might be a way to get around this.

@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 5, 2024

Ah, yeah, that does look similar.

I'd been thinking that we'd need to generate the intermediate null masks, but #4252 reminds me that it would be equally valid (and rather more efficient) to instead skip the null mask for the primitive array. I've pushed a small commit to the branch I linked above that does the naive thing: dropping the generated null mask instead of attaching it to the generated array data. And it does pass tests!

@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 7, 2024

On reflection, skipping the null buffer for non-nullable fields feels best to me... both generating the intermediate null masks and dropping unnecessary null masks would fix the issue, so might as well bias to the one that involves less memory / compute.

I went ahead and opened the linked branch as a PR, though I'm happy to update it based on any other feedback on the issue here!

@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

label_issue.py automatically added labels {'parquet'} from #6524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants