GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding #38367

zeroshade · 2023-10-19T21:48:30Z

Rationale for this change

Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder.

What changes are included in this PR?

Adding RleBooleanEncoder and RleBooleanDecoder and updating the parquet-testing repo.

Are these changes tested?

Unit tests are added, and this is also tested via the parquet-testing golden files.

kou

(I think that we don't need to rush to implement this for 14.0.0 because it's already feature freezed.)

go/parquet/internal/encoding/boolean_decoder.go

go/parquet/internal/encoding/boolean_encoder.go

Co-authored-by: Sutou Kouhei <[email protected]>

pitrou

Should there be a higher-level roundtrip test?

Also, can you add a file read test from data/rle_boolean_encoding.parquet?

go/parquet/internal/encoding/encoding_test.go

go/parquet/internal/encoding/boolean_encoder.go

pitrou · 2023-10-23T13:22:45Z

go/parquet/internal/encoding/boolean_decoder.go

+		batch := shared_utils.MinInt(len(buf), n)
+		decoded := dec.rleDec.GetBatch(buf[:batch])
+		if decoded != batch {
+			return max - n, io.ErrUnexpectedEOF


Is there a point in returning max - n here? What is the caller supposed to do with it?
(note you aren't actually writing out the decoded values, so I assume the decoded bytes are lost, which means the decoder can't be used anymore afterwards)

I guess we may meet the case if user use Decoder directly.

However, during arrow reading it, I guess the upper reader will keep non-null value count and didn't hit the branch here?

We are writing out the decoded values via the loop at line 165. The input to this function is an output slice to write to and we populate it after the call to GetBatch. So returning max - n here informs the caller of how many values were populated into that slice before the error was hit.

But dec.rleDec.GetBatch has consumed some input and decoded some bytes that are just lost in buf, right?

Correct. After receiving an error, a given decoder should not continue to be used, but we return the number of successfully output values so that a caller knows what values are there before the error was encountered.

This wouldn't matter without using raw-decode api. Because RecordReader will handing Decode well. But it's still make the syntax a bit inconsistent

Hmm, you have a point though I think that would be a greater issue to think about attempting to fix.

For DataPageV2Header it contains the number of nulls so we can easily handle changing dec.nvals to the correct number, but for DataPageV1 you have a point that technically this might be slightly off or is at least a bug waiting to happen that should be addressed.

Yeah, lets create an issue for that. I think user would not easily touch it because we always suggest using RecordReader, right?

or going through the ColumnChunkReader which has similar correct handling. We don't actually expose the raw decoder api for users to access at all.

Aha that's right.

go/parquet/internal/encoding/boolean_decoder.go

pitrou · 2023-10-23T13:27:27Z

go/parquet/internal/encoding/boolean_decoder.go

+			return 0, err
+		}
+		if valuesRead != toRead {
+			return valuesRead, xerrors.New("parquet: rle boolean decoder: number of values / definition levels read did not match")


Hmm, what is the rationale for using xerrors here rather than the go stdlib as in other places?

at the time I originally wrote the parquet code, xerrors was used for particular benefits of wrapping and otherwise that weren't available in the stdlib. Since then, all the features of xerrors have been folded into the go stdlib and there really isn't a reason to use it anymore, I intend to phase it out as I make changes to the code. So I'm going to fix this to just use errors, it was my mistake to propagate xerrors here so thanks for catching it.

go/parquet/internal/encoding/boolean_decoder.go

mapleFU · 2023-10-23T13:44:24Z

go/parquet/internal/encoding/boolean_decoder.go

+		batch := shared_utils.MinInt(len(buf), n)
+		decoded := dec.rleDec.GetBatch(buf[:batch])
+		if decoded != batch {
+			return max - n, io.ErrUnexpectedEOF


I guess we may meet the case if user use Decoder directly.

However, during arrow reading it, I guess the upper reader will keep non-null value count and didn't hit the branch here?

mapleFU · 2023-10-23T13:45:14Z

go/parquet/internal/encoding/boolean_decoder.go

+	for n > 0 {
+		batch := shared_utils.MinInt(len(buf), n)
+		decoded := dec.rleDec.GetBatch(buf[:batch])
+		if decoded != batch {


Should dec.nvals dec in this branch?

in theory the return of the error would indicate that the decoder should not be used any further, so I don't know if we necessarily need to decode dec.nvals but it also wouldn't be wrong to do so.

kou · 2023-10-25T01:57:08Z

@zeroshade Could you open a new issue for boolean RLE support?

github-actions · 2023-10-25T15:46:39Z

⚠️ GitHub issue #38462 has been automatically assigned in GitHub to PR creator.

zeroshade · 2023-10-26T19:49:55Z

@pitrou I don't think we need a higher level roundtrip test than the tests we currently have and use for this.

As for an individual file, we've confirmed that the updated parquet-testing files use the RLE encoding for boolean columns, so I believe i need to add a separate test for that particular file. Thoughts?

zeroshade · 2023-10-26T19:56:03Z

I'd like to get this in relatively soon as all CI for go that runs the parquet tests is going to fail on the main branch until this is merged

pitrou · 2023-10-26T19:57:06Z

As for an individual file, we've confirmed that the updated parquet-testing files use the RLE encoding for boolean columns, so I believe i need to add a separate test for that particular file.

Yes.

zeroshade · 2023-10-26T20:47:13Z

@pitrou added the test for the rle_boolean_encoding.parquet file

mapleFU · 2023-10-30T07:35:45Z

LGTM beside the syntax here: https://github.com/apache/arrow/pull/38367/files#r1375773113

Don't know if this is expected.

go/parquet/file/file_reader_test.go

pitrou

LGTM. Just one testing but you may ignore it if you prefer.

conbench-apache-arrow · 2023-10-30T20:14:31Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 23b62a4.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…pache#38367) ### Rationale for this change Looks like the parquet-testing repo files have been updated and now include boolean columns which use the RLE encoding type. This causes the Go parquet lib to fail verification tests when it pulls the most recent commits for the parquet-testing repository. So a solution for this is to actually implement the RleBoolean encoder and decoder. ### What changes are included in this PR? Adding `RleBooleanEncoder` and `RleBooleanDecoder` and updating the `parquet-testing` repo. ### Are these changes tested? Unit tests are added, and this is also tested via the `parquet-testing` golden files. * Closes: apache#38345 * Closes: apache#38462 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Matt Topol <[email protected]>

apacheGH-38345: [Go][Parquet] Handle Boolean RLE encoding/decoding

c121de0

zeroshade requested review from kou, lidavidm, raulcd and pitrou October 19, 2023 21:48

github-actions bot added Component: Go Component: C++ awaiting committer review Awaiting committer review labels Oct 19, 2023

kou reviewed Oct 20, 2023

View reviewed changes

go/parquet/internal/encoding/boolean_decoder.go Show resolved Hide resolved

go/parquet/internal/encoding/boolean_decoder.go Outdated Show resolved Hide resolved

go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 20, 2023

Update go/parquet/internal/encoding/boolean_decoder.go

e1ac231

Co-authored-by: Sutou Kouhei <[email protected]>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 20, 2023

pitrou requested changes Oct 23, 2023

View reviewed changes

mapleFU reviewed Oct 23, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 23, 2023

updates from feedback

a7bfc0f

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 23, 2023

zeroshade changed the title ~~GH-38345: [Go][Parquet] Handle Boolean RLE encoding/decoding~~ GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding Oct 25, 2023

Rename Buffer size funcs

80d1eef

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 25, 2023

github-actions bot added the awaiting changes Awaiting changes label Oct 25, 2023

add rle boolean round trip spaced

0d86243

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 25, 2023

add rle_boolean_encoding file test

b9fad93

zeroshade requested review from pitrou and mapleFU October 26, 2023 21:43

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 30, 2023

pitrou reviewed Oct 30, 2023

View reviewed changes

go/parquet/file/file_reader_test.go Outdated Show resolved Hide resolved

pitrou approved these changes Oct 30, 2023

View reviewed changes

assert defLvls are 1

cd2fc77

zeroshade merged commit 23b62a4 into apache:main Oct 30, 2023

zeroshade removed the awaiting changes Awaiting changes label Oct 30, 2023

zeroshade deleted the rle-boolean-encoding branch October 30, 2023 15:12

github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding #38367

GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding #38367

zeroshade commented Oct 19, 2023 •

edited by github-actions bot

Loading

kou left a comment

pitrou left a comment

pitrou Oct 23, 2023

mapleFU Oct 23, 2023

zeroshade Oct 23, 2023 •

edited

Loading

pitrou Oct 24, 2023

zeroshade Oct 24, 2023

mapleFU Oct 30, 2023

zeroshade Oct 30, 2023

mapleFU Oct 30, 2023

zeroshade Oct 30, 2023

mapleFU Oct 30, 2023

pitrou Oct 23, 2023

zeroshade Oct 23, 2023

mapleFU Oct 23, 2023

mapleFU Oct 23, 2023

zeroshade Oct 24, 2023

kou commented Oct 25, 2023

github-actions bot commented Oct 25, 2023

zeroshade commented Oct 26, 2023

zeroshade commented Oct 26, 2023

pitrou commented Oct 26, 2023

zeroshade commented Oct 26, 2023

mapleFU commented Oct 30, 2023

pitrou left a comment

conbench-apache-arrow bot commented Oct 30, 2023

GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding #38367

GH-38462: [Go][Parquet] Handle Boolean RLE encoding/decoding #38367

Conversation

zeroshade commented Oct 19, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

kou left a comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou commented Oct 25, 2023

github-actions bot commented Oct 25, 2023

zeroshade commented Oct 26, 2023

zeroshade commented Oct 26, 2023

pitrou commented Oct 26, 2023

zeroshade commented Oct 26, 2023

mapleFU commented Oct 30, 2023

pitrou left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Oct 30, 2023

zeroshade commented Oct 19, 2023 •

edited by github-actions bot

Loading

zeroshade Oct 23, 2023 •

edited

Loading