Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding #38367

Merged
merged 7 commits into from
Oct 30, 2023

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Oct 19, 2023

Rationale for this change

Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder.

What changes are included in this PR?

Adding RleBooleanEncoder and RleBooleanDecoder and updating the parquet-testing repo.

Are these changes tested?

Unit tests are added, and this is also tested via the parquet-testing golden files.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think that we don't need to rush to implement this for 14.0.0 because it's already feature freezed.)

go/parquet/internal/encoding/boolean_decoder.go Outdated Show resolved Hide resolved
go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 20, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 20, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a higher-level roundtrip test?

Also, can you add a file read test from data/rle_boolean_encoding.parquet?

go/parquet/internal/encoding/encoding_test.go Show resolved Hide resolved
go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved
go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved
go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved
go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved
batch := shared_utils.MinInt(len(buf), n)
decoded := dec.rleDec.GetBatch(buf[:batch])
if decoded != batch {
return max - n, io.ErrUnexpectedEOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a point in returning max - n here? What is the caller supposed to do with it?
(note you aren't actually writing out the decoded values, so I assume the decoded bytes are lost, which means the decoder can't be used anymore afterwards)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we may meet the case if user use Decoder directly.

However, during arrow reading it, I guess the upper reader will keep non-null value count and didn't hit the branch here?

Copy link
Member Author

@zeroshade zeroshade Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are writing out the decoded values via the loop at line 165. The input to this function is an output slice to write to and we populate it after the call to GetBatch. So returning max - n here informs the caller of how many values were populated into that slice before the error was hit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But dec.rleDec.GetBatch has consumed some input and decoded some bytes that are just lost in buf, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. After receiving an error, a given decoder should not continue to be used, but we return the number of successfully output values so that a caller knows what values are there before the error was encountered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't matter without using raw-decode api. Because RecordReader will handing Decode well. But it's still make the syntax a bit inconsistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you have a point though I think that would be a greater issue to think about attempting to fix.

For DataPageV2Header it contains the number of nulls so we can easily handle changing dec.nvals to the correct number, but for DataPageV1 you have a point that technically this might be slightly off or is at least a bug waiting to happen that should be addressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lets create an issue for that. I think user would not easily touch it because we always suggest using RecordReader, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or going through the ColumnChunkReader which has similar correct handling. We don't actually expose the raw decoder api for users to access at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha that's right.

return 0, err
}
if valuesRead != toRead {
return valuesRead, xerrors.New("parquet: rle boolean decoder: number of values / definition levels read did not match")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what is the rationale for using xerrors here rather than the go stdlib as in other places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the time I originally wrote the parquet code, xerrors was used for particular benefits of wrapping and otherwise that weren't available in the stdlib. Since then, all the features of xerrors have been folded into the go stdlib and there really isn't a reason to use it anymore, I intend to phase it out as I make changes to the code. So I'm going to fix this to just use errors, it was my mistake to propagate xerrors here so thanks for catching it.

batch := shared_utils.MinInt(len(buf), n)
decoded := dec.rleDec.GetBatch(buf[:batch])
if decoded != batch {
return max - n, io.ErrUnexpectedEOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we may meet the case if user use Decoder directly.

However, during arrow reading it, I guess the upper reader will keep non-null value count and didn't hit the branch here?

for n > 0 {
batch := shared_utils.MinInt(len(buf), n)
decoded := dec.rleDec.GetBatch(buf[:batch])
if decoded != batch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should dec.nvals dec in this branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory the return of the error would indicate that the decoder should not be used any further, so I don't know if we necessarily need to decode dec.nvals but it also wouldn't be wrong to do so.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 23, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 23, 2023
@kou
Copy link
Member

kou commented Oct 25, 2023

@zeroshade Could you open a new issue for boolean RLE support?

@zeroshade zeroshade changed the title GH-38345: [Go][Parquet] Handle Boolean RLE encoding/decoding GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding Oct 25, 2023
@github-actions
Copy link

⚠️ GitHub issue #38462 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 25, 2023
@github-actions github-actions bot added the awaiting changes Awaiting changes label Oct 25, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 25, 2023
@zeroshade
Copy link
Member Author

@pitrou I don't think we need a higher level roundtrip test than the tests we currently have and use for this.

As for an individual file, we've confirmed that the updated parquet-testing files use the RLE encoding for boolean columns, so I believe i need to add a separate test for that particular file. Thoughts?

@zeroshade
Copy link
Member Author

I'd like to get this in relatively soon as all CI for go that runs the parquet tests is going to fail on the main branch until this is merged

@pitrou
Copy link
Member

pitrou commented Oct 26, 2023

As for an individual file, we've confirmed that the updated parquet-testing files use the RLE encoding for boolean columns, so I believe i need to add a separate test for that particular file.

Yes.

@zeroshade
Copy link
Member Author

@pitrou added the test for the rle_boolean_encoding.parquet file

@zeroshade zeroshade requested review from pitrou and mapleFU October 26, 2023 21:43
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2023

LGTM beside the syntax here: https://github.com/apache/arrow/pull/38367/files#r1375773113

Don't know if this is expected.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 30, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one testing but you may ignore it if you prefer.

@zeroshade zeroshade merged commit 23b62a4 into apache:main Oct 30, 2023
41 of 45 checks passed
@zeroshade zeroshade removed the awaiting changes Awaiting changes label Oct 30, 2023
@zeroshade zeroshade deleted the rle-boolean-encoding branch October 30, 2023 15:12
@github-actions github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Oct 30, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 23b62a4.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#38367)

### Rationale for this change
Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder.

### What changes are included in this PR?
Adding `RleBooleanEncoder` and `RleBooleanDecoder` and updating the `parquet-testing` repo.

### Are these changes tested?
Unit tests are added, and this is also tested via the `parquet-testing` golden files.

* Closes: apache#38345
* Closes: apache#38462

Lead-authored-by: Matt Topol <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#38367)

### Rationale for this change
Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder.

### What changes are included in this PR?
Adding `RleBooleanEncoder` and `RleBooleanDecoder` and updating the `parquet-testing` repo.

### Are these changes tested?
Unit tests are added, and this is also tested via the `parquet-testing` golden files.

* Closes: apache#38345
* Closes: apache#38462

Lead-authored-by: Matt Topol <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants