Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip writing down null buffers for non-nullable primitive arrays #6524

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Oct 7, 2024

Which issue does this PR close?

Closes #6510.

Rationale for this change

I looked into not generating the null buffer at all for non-nullable primitive arrays, but it looks like it's used by some other logic in a way that might be tricky to replace, and this is enough to fix the bug.

What changes are included in this PR?

  • Enable array-data validations in parquet unit tests. (Seems generally useful, but also flags this bug.)
  • Skips including a null buffer in the array data for non-nullable primitive arrays.

Are there any user-facing changes?

There are many valid arrow representations for a particular parquet file, and this switches between two of them. It's not clear to me whether this is considered a change to the public interface or not... happy to take feedback there.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 7, 2024
@bkirwi bkirwi changed the title Parquet bug Skip writing down null buffers for non-nullable primitive arrays Oct 7, 2024
@@ -71,6 +71,9 @@ sysinfo = { version = "0.31.2", optional = true, default-features = false, featu
crc32fast = { version = "1.4.2", optional = true, default-features = false }

[dev-dependencies]
# Enable additional validations when running tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will disable/interfere with various validation tests within the codebase, perhaps we could instead just add validation within the fuzz test harness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah - I'd forgotten this would enable the feature for other crates when running tests in batch.

I didn't find any fuzz tests in parquet that would be easy to adapt. I ended up adding a new roundtrip test, and enhancing some of the default roundtrip validations to catch the issue.

I suspect it may make sense to add some negative tests and other coverage. I'm happy to come back to that if folks like the overall approach here.

@tustvold
Copy link
Contributor

tustvold commented Oct 8, 2024

I think this will solve this issue for PrimitiveArray, but other array types will also run into the same issue. The same approach can be applied in most cases, apart from potentially DictionaryArray - where nulls may not actually have a valid value. I'm not sure what the solution to that is

@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 8, 2024

I think this will solve this issue for PrimitiveArray, but other array types will also run into the same issue. The same approach can be applied in most cases, apart from potentially DictionaryArray - where nulls may not actually have a valid value. I'm not sure what the solution to that is

Great comments - let me convert to draft while I think those over.

@bkirwi bkirwi marked this pull request as draft October 8, 2024 16:03
The new validations should help improve coverage from the existing
tests, but none of them currently cover the null-masking bug.
@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 11, 2024

I think this will solve this issue for PrimitiveArray, but other array types will also run into the same issue.

I worked through all the "leaf" types methodically, and it seemed that all of them ended up wanting the exact same implementation as I had for primitive types - dropping the null buffer iff the column type was not optional - so I shifted the logic into the reader impl itself.

The same approach can be applied in most cases, apart from potentially DictionaryArray - where nulls may not actually have a valid value.

This appears to work out with the implementation above: the validate_non_nullable validation just looks at the dictionary array's null buffer (and not, for example, logical_nulls or anything else in the key array) which is precisely the array that we're skipping over. I've added a case to the test that should check this specifically.

I've gone ahead and pushed these changes, and I think this addresses all the issues I'm aware of... taking this back out of draft.

@bkirwi bkirwi marked this pull request as ready for review October 11, 2024 19:17
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I wanted some time to sit down and test the empty dictionary case.

The reason this potentially poses a problem is a dictionary with an empty values array, has no valid key. This matters because if you remove the null mask from a DictionaryArray, all of the dict.keys().values() must be < dict.values().len() even those that would have previously been masked by the dict.nulls().

This is normally fine, as the way null padding works is that all of keys().values() will either be 0 or some valid decoded index.

The problem with an empty dictionary is 0 is not actually valid, and therefore the code would be unsound 😅.

However, the reason this actually works is a bug? in DictionaryDecoder::read where it short-circuits if the dictionary is empty without calling DictionaryBuffer::as_keys. As a result this triggers the non-dictionary fallback logic, where data is "decoded" as the underlying ByteArray, stripping away the null mask from this is safe and yields an array of empty strings. This is then cast to a dictionary, of a single empty string.

So the TLDR is this works, but somewhat by accident, and is relying on what is arguably a bug in the DictionaryDecoder. I don't really have a good solution for this atm, so I suggest we just add the test for this case, and if DictionaryDecoder then changes it will hopefully catch this... It's a bit uncomfortable, but ultimately the definition of DictionaryArray is just a tad unfortunate in this regard.

parquet/src/arrow/arrow_writer/mod.rs Show resolved Hide resolved
@tustvold tustvold merged commit 936e40e into apache:master Oct 21, 2024
16 checks passed
@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 25, 2024

Thanks for the thorough analysis on this! I had figured that since ArrayData validations were passing - which include the dictionary bounds checks - things were fine... and they are, but in a slightly more accidental way than I had appreciated. 😅

I suppose we'd have avoided this problem if we took the "maximal" approach, and generated a bunch of intermediate null buffers instead of removing "unnecessary" ones. An argument in favour of that approach I guess if we ever end up needing to revisit...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet reader can generate incorrect validity buffer information for nested structures
4 participants