-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reading of dictionary encoded pages with null values (#1111) #1130
Fix reading of dictionary encoded pages with null values (#1111) #1130
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1130 +/- ##
==========================================
+ Coverage 82.33% 82.38% +0.04%
==========================================
Files 169 169
Lines 49773 50247 +474
==========================================
+ Hits 40981 41395 +414
- Misses 8792 8852 +60
Continue to review full report at Codecov.
|
Thank you @yordan-pavlov -- I plan to review / test this patch later today |
FWIW I tested with datafusion on the case where we initially observed this issue and this patch fixes the issue: apache/datafusion#1441 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @yordan-pavlov -- as I mentioned, I verified that this change fixes the issues we had been seeing in DataFusion;
The basic idea of this PR looks great and makes sense to me, but I am not comfortable enough with the implementation of the parquet reader to fully understand it.
I wonder if @tustvold, @sunchao or @nevi-me could have a look?
Also, I wonder if there are ay benchmark results to demonstrate the performance change related to this PR?
num_values: usize, | ||
) -> Result<usize> { | ||
let mut def_level_decoder = LevelValueDecoder::new(level_decoder); | ||
let def_level_array = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean an entirely new array of definition levels is created for each column? Might this result in a non trivial amount of extra runtime overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - def levels are decoded a second time for this fix and an i16 array and a boolean array are created to count the non-null values, but they only live for a very short time and the negative effect on performance is surprisingly small (3% to 8% in my benchmark run) see here: #1111 (comment) ; even after this change the ArrowArrayReader
is still often several times faster for decoding strings compared to the old ArrayReader
, this hasn't changed much.
It's probably possible to make this more efficient, but it would require more thinking and more time for a bigger change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense -- thank you for running the numbers
Thank you for this 🎉. I will take a look today if I have time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense to me, and seems like a pragmatic quick fix for the bug 👍
@@ -398,19 +416,37 @@ impl<'a, C: ArrayConverter + 'a> ArrowArrayReader<'a, C> { | |||
offset, | |||
def_levels_byte_len, | |||
); | |||
let value_count = Self::count_def_level_values( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to do num_values - num_nulls
here as the DataPageV2 has this information. Unfortunately very few implementations in practice seem to produce such pages, and tbh I'm not entirely sure if num_nulls is what I think it is...
Thanks again @yordan-pavlov |
Which issue does this PR close?
Closes #1111.
Rationale for this change
As explained in #1111
RleDecoder
as used inVariableLenDictionaryDecoder
as part of the implementation ofArrowArrayReader
, incorrectly returns more keys than are actually available while at the same time, when the page contains NULLsVariableLenDictionaryDecoder
is also requesting more keys than available becausenum_values
is inclusive of NULLs. This then results in incorrectly decoding a dictionary-encoded page which also contains NULLs and returning more values than necessary.What changes are included in this PR?
This PR contains:
num_values
from the data page) when creating the value decoder, so that it knows how many values are actually available. This is then used in existing code inVariableLenDictionaryDecoder
to limit how many keys are requested from the nestedRleDecoder
.test_arrow_array_reader_dict_enc_string
forArrowArrayReader
test_complex_array_reader_dict_enc_string
forArrayReader
Are there any user-facing changes?
No
@alamb @tustvold