-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip writing down null buffers for non-nullable primitive arrays #6524
Conversation
parquet/Cargo.toml
Outdated
@@ -71,6 +71,9 @@ sysinfo = { version = "0.31.2", optional = true, default-features = false, featu | |||
crc32fast = { version = "1.4.2", optional = true, default-features = false } | |||
|
|||
[dev-dependencies] | |||
# Enable additional validations when running tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will disable/interfere with various validation tests within the codebase, perhaps we could instead just add validation within the fuzz test harness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah - I'd forgotten this would enable the feature for other crates when running tests in batch.
I didn't find any fuzz tests in parquet that would be easy to adapt. I ended up adding a new roundtrip test, and enhancing some of the default roundtrip validations to catch the issue.
I suspect it may make sense to add some negative tests and other coverage. I'm happy to come back to that if folks like the overall approach here.
I think this will solve this issue for PrimitiveArray, but other array types will also run into the same issue. The same approach can be applied in most cases, apart from potentially DictionaryArray - where nulls may not actually have a valid value. I'm not sure what the solution to that is |
Great comments - let me convert to draft while I think those over. |
The new validations should help improve coverage from the existing tests, but none of them currently cover the null-masking bug.
I worked through all the "leaf" types methodically, and it seemed that all of them ended up wanting the exact same implementation as I had for primitive types - dropping the null buffer iff the column type was not optional - so I shifted the logic into the reader impl itself.
This appears to work out with the implementation above: the I've gone ahead and pushed these changes, and I think this addresses all the issues I'm aware of... taking this back out of draft. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I wanted some time to sit down and test the empty dictionary case.
The reason this potentially poses a problem is a dictionary with an empty values array, has no valid key. This matters because if you remove the null mask from a DictionaryArray, all of the dict.keys().values()
must be < dict.values().len()
even those that would have previously been masked by the dict.nulls()
.
This is normally fine, as the way null padding works is that all of keys().values()
will either be 0
or some valid decoded index.
The problem with an empty dictionary is 0
is not actually valid, and therefore the code would be unsound 😅.
However, the reason this actually works is a bug? in DictionaryDecoder::read where it short-circuits if the dictionary is empty without calling DictionaryBuffer::as_keys
. As a result this triggers the non-dictionary fallback logic, where data is "decoded" as the underlying ByteArray
, stripping away the null mask from this is safe and yields an array of empty strings. This is then cast to a dictionary, of a single empty string.
So the TLDR is this works, but somewhat by accident, and is relying on what is arguably a bug in the DictionaryDecoder. I don't really have a good solution for this atm, so I suggest we just add the test for this case, and if DictionaryDecoder then changes it will hopefully catch this... It's a bit uncomfortable, but ultimately the definition of DictionaryArray is just a tad unfortunate in this regard.
Thanks for the thorough analysis on this! I had figured that since ArrayData validations were passing - which include the dictionary bounds checks - things were fine... and they are, but in a slightly more accidental way than I had appreciated. 😅 I suppose we'd have avoided this problem if we took the "maximal" approach, and generated a bunch of intermediate null buffers instead of removing "unnecessary" ones. An argument in favour of that approach I guess if we ever end up needing to revisit... |
Which issue does this PR close?
Closes #6510.
Rationale for this change
I looked into not generating the null buffer at all for non-nullable primitive arrays, but it looks like it's used by some other logic in a way that might be tricky to replace, and this is enough to fix the bug.
What changes are included in this PR?
Are there any user-facing changes?
There are many valid arrow representations for a particular parquet file, and this switches between two of them. It's not clear to me whether this is considered a change to the public interface or not... happy to take feedback there.