-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding #38367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think that we don't need to rush to implement this for 14.0.0 because it's already feature freezed.)
Co-authored-by: Sutou Kouhei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a higher-level roundtrip test?
Also, can you add a file read test from data/rle_boolean_encoding.parquet
?
batch := shared_utils.MinInt(len(buf), n) | ||
decoded := dec.rleDec.GetBatch(buf[:batch]) | ||
if decoded != batch { | ||
return max - n, io.ErrUnexpectedEOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a point in returning max - n
here? What is the caller supposed to do with it?
(note you aren't actually writing out the decoded values, so I assume the decoded bytes are lost, which means the decoder can't be used anymore afterwards)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we may meet the case if user use Decoder
directly.
However, during arrow reading it, I guess the upper reader will keep non-null value count and didn't hit the branch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are writing out the decoded values via the loop at line 165. The input to this function is an output slice to write to and we populate it after the call to GetBatch
. So returning max - n
here informs the caller of how many values were populated into that slice before the error was hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But dec.rleDec.GetBatch
has consumed some input and decoded some bytes that are just lost in buf
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. After receiving an error, a given decoder should not continue to be used, but we return the number of successfully output values so that a caller knows what values are there before the error was encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't matter without using raw-decode api. Because RecordReader will handing Decode
well. But it's still make the syntax a bit inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you have a point though I think that would be a greater issue to think about attempting to fix.
For DataPageV2Header it contains the number of nulls so we can easily handle changing dec.nvals
to the correct number, but for DataPageV1 you have a point that technically this might be slightly off or is at least a bug waiting to happen that should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, lets create an issue for that. I think user would not easily touch it because we always suggest using RecordReader, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or going through the ColumnChunkReader which has similar correct handling. We don't actually expose the raw decoder api for users to access at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha that's right.
return 0, err | ||
} | ||
if valuesRead != toRead { | ||
return valuesRead, xerrors.New("parquet: rle boolean decoder: number of values / definition levels read did not match") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what is the rationale for using xerrors
here rather than the go stdlib as in other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the time I originally wrote the parquet code, xerrors
was used for particular benefits of wrapping and otherwise that weren't available in the stdlib. Since then, all the features of xerrors
have been folded into the go stdlib and there really isn't a reason to use it anymore, I intend to phase it out as I make changes to the code. So I'm going to fix this to just use errors
, it was my mistake to propagate xerrors
here so thanks for catching it.
batch := shared_utils.MinInt(len(buf), n) | ||
decoded := dec.rleDec.GetBatch(buf[:batch]) | ||
if decoded != batch { | ||
return max - n, io.ErrUnexpectedEOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we may meet the case if user use Decoder
directly.
However, during arrow reading it, I guess the upper reader will keep non-null value count and didn't hit the branch here?
for n > 0 { | ||
batch := shared_utils.MinInt(len(buf), n) | ||
decoded := dec.rleDec.GetBatch(buf[:batch]) | ||
if decoded != batch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should dec.nvals
dec in this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory the return of the error would indicate that the decoder should not be used any further, so I don't know if we necessarily need to decode dec.nvals
but it also wouldn't be wrong to do so.
@zeroshade Could you open a new issue for boolean RLE support? |
|
@pitrou I don't think we need a higher level roundtrip test than the tests we currently have and use for this. As for an individual file, we've confirmed that the updated |
I'd like to get this in relatively soon as all CI for go that runs the parquet tests is going to fail on the main branch until this is merged |
Yes. |
@pitrou added the test for the rle_boolean_encoding.parquet file |
LGTM beside the syntax here: https://github.com/apache/arrow/pull/38367/files#r1375773113 Don't know if this is expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one testing but you may ignore it if you prefer.
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 23b62a4. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…pache#38367) ### Rationale for this change Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder. ### What changes are included in this PR? Adding `RleBooleanEncoder` and `RleBooleanDecoder` and updating the `parquet-testing` repo. ### Are these changes tested? Unit tests are added, and this is also tested via the `parquet-testing` golden files. * Closes: apache#38345 * Closes: apache#38462 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…pache#38367) ### Rationale for this change Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder. ### What changes are included in this PR? Adding `RleBooleanEncoder` and `RleBooleanDecoder` and updating the `parquet-testing` repo. ### Are these changes tested? Unit tests are added, and this is also tested via the `parquet-testing` golden files. * Closes: apache#38345 * Closes: apache#38462 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Rationale for this change
Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder.
What changes are included in this PR?
Adding
RleBooleanEncoder
andRleBooleanDecoder
and updating theparquet-testing
repo.Are these changes tested?
Unit tests are added, and this is also tested via the
parquet-testing
golden files.