Skip writing down null buffers for non-nullable primitive arrays #6524

bkirwi · 2024-10-07T20:17:07Z

Which issue does this PR close?

Closes #6510.

Rationale for this change

I looked into not generating the null buffer at all for non-nullable primitive arrays, but it looks like it's used by some other logic in a way that might be tricky to replace, and this is enough to fix the bug.

What changes are included in this PR?

Enable array-data validations in parquet unit tests. (Seems generally useful, but also flags this bug.)
Skips including a null buffer in the array data for non-nullable primitive arrays.

Are there any user-facing changes?

There are many valid arrow representations for a particular parquet file, and this switches between two of them. It's not clear to me whether this is considered a change to the public interface or not... happy to take feedback there.

tustvold · 2024-10-08T14:21:43Z

parquet/Cargo.toml

@@ -71,6 +71,9 @@ sysinfo = { version = "0.31.2", optional = true, default-features = false, featu
 crc32fast = { version = "1.4.2", optional = true, default-features = false }

 [dev-dependencies]
+# Enable additional validations when running tests


This will disable/interfere with various validation tests within the codebase, perhaps we could instead just add validation within the fuzz test harness?

Ah yeah - I'd forgotten this would enable the feature for other crates when running tests in batch.

I didn't find any fuzz tests in parquet that would be easy to adapt. I ended up adding a new roundtrip test, and enhancing some of the default roundtrip validations to catch the issue.

I suspect it may make sense to add some negative tests and other coverage. I'm happy to come back to that if folks like the overall approach here.

tustvold · 2024-10-08T14:24:20Z

I think this will solve this issue for PrimitiveArray, but other array types will also run into the same issue. The same approach can be applied in most cases, apart from potentially DictionaryArray - where nulls may not actually have a valid value. I'm not sure what the solution to that is

bkirwi · 2024-10-08T16:03:24Z

I think this will solve this issue for PrimitiveArray, but other array types will also run into the same issue. The same approach can be applied in most cases, apart from potentially DictionaryArray - where nulls may not actually have a valid value. I'm not sure what the solution to that is

Great comments - let me convert to draft while I think those over.

The new validations should help improve coverage from the existing tests, but none of them currently cover the null-masking bug.

bkirwi · 2024-10-11T19:17:14Z

I think this will solve this issue for PrimitiveArray, but other array types will also run into the same issue.

I worked through all the "leaf" types methodically, and it seemed that all of them ended up wanting the exact same implementation as I had for primitive types - dropping the null buffer iff the column type was not optional - so I shifted the logic into the reader impl itself.

The same approach can be applied in most cases, apart from potentially DictionaryArray - where nulls may not actually have a valid value.

This appears to work out with the implementation above: the validate_non_nullable validation just looks at the dictionary array's null buffer (and not, for example, logical_nulls or anything else in the key array) which is precisely the array that we're skipping over. I've added a case to the test that should check this specifically.

I've gone ahead and pushed these changes, and I think this addresses all the issues I'm aware of... taking this back out of draft.

tustvold

Sorry for the delay, I wanted some time to sit down and test the empty dictionary case.

The reason this potentially poses a problem is a dictionary with an empty values array, has no valid key. This matters because if you remove the null mask from a DictionaryArray, all of the dict.keys().values() must be < dict.values().len() even those that would have previously been masked by the dict.nulls().

This is normally fine, as the way null padding works is that all of keys().values() will either be 0 or some valid decoded index.

The problem with an empty dictionary is 0 is not actually valid, and therefore the code would be unsound 😅.

However, the reason this actually works is a bug? in DictionaryDecoder::read where it short-circuits if the dictionary is empty without calling DictionaryBuffer::as_keys. As a result this triggers the non-dictionary fallback logic, where data is "decoded" as the underlying ByteArray, stripping away the null mask from this is safe and yields an array of empty strings. This is then cast to a dictionary, of a single empty string.

So the TLDR is this works, but somewhat by accident, and is relying on what is arguably a bug in the DictionaryDecoder. I don't really have a good solution for this atm, so I suggest we just add the test for this case, and if DictionaryDecoder then changes it will hopefully catch this... It's a bit uncomfortable, but ultimately the definition of DictionaryArray is just a tad unfortunate in this regard.

parquet/src/arrow/arrow_writer/mod.rs

bkirwi · 2024-10-25T14:04:07Z

Thanks for the thorough analysis on this! I had figured that since ArrayData validations were passing - which include the dictionary bounds checks - things were fine... and they are, but in a slightly more accidental way than I had appreciated. 😅

I suppose we'd have avoided this problem if we took the "maximal" approach, and generated a bunch of intermediate null buffers instead of removing "unnecessary" ones. An argument in favour of that approach I guess if we ever end up needing to revisit...

github-actions bot added the parquet Changes to the parquet crate label Oct 7, 2024

bkirwi changed the title ~~Parquet bug~~ Skip writing down null buffers for non-nullable primitive arrays Oct 7, 2024

tustvold reviewed Oct 8, 2024

View reviewed changes

mbrobbel approved these changes Oct 8, 2024

View reviewed changes

bkirwi marked this pull request as draft October 8, 2024 16:03

bkirwi added 3 commits October 11, 2024 12:30

Add a new roundtrip test and improve validations

55ce54c

The new validations should help improve coverage from the existing tests, but none of them currently cover the null-masking bug.

Don't return nulls for non-nullable cols in reader

133503c

Coverage for more non-null nested leaf types

26ae77c

bkirwi force-pushed the parquet-bug branch from e49cfd4 to 26ae77c Compare October 11, 2024 16:43

bkirwi marked this pull request as ready for review October 11, 2024 19:17

findepi approved these changes Oct 12, 2024

View reviewed changes

tustvold approved these changes Oct 16, 2024

View reviewed changes

parquet/src/arrow/arrow_writer/mod.rs Show resolved Hide resolved

Update parquet/src/arrow/arrow_writer/mod.rs

5b83d7a

tustvold merged commit 936e40e into apache:master Oct 21, 2024
16 checks passed

bkirwi deleted the parquet-bug branch October 25, 2024 14:05

alamb mentioned this pull request Nov 16, 2024

Parquet reader can generate incorrect validity buffer information for nested structures #6510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip writing down null buffers for non-nullable primitive arrays #6524

Skip writing down null buffers for non-nullable primitive arrays #6524

bkirwi commented Oct 7, 2024

tustvold Oct 8, 2024

bkirwi Oct 11, 2024

tustvold commented Oct 8, 2024

bkirwi commented Oct 8, 2024

bkirwi commented Oct 11, 2024

tustvold left a comment

bkirwi commented Oct 25, 2024

Skip writing down null buffers for non-nullable primitive arrays #6524

Skip writing down null buffers for non-nullable primitive arrays #6524

Conversation

bkirwi commented Oct 7, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Oct 8, 2024

Choose a reason for hiding this comment

bkirwi Oct 11, 2024

Choose a reason for hiding this comment

tustvold commented Oct 8, 2024

bkirwi commented Oct 8, 2024

bkirwi commented Oct 11, 2024

tustvold left a comment

Choose a reason for hiding this comment

bkirwi commented Oct 25, 2024