Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12104: [Go][Parquet] Second chunk of Ported Go Parquet code #9817

Closed
wants to merge 14 commits into from

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Mar 26, 2021

Following up from #9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

@github-actions
Copy link

Copy link
Contributor

@WilliamWhispell WilliamWhispell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sofar += n
}
if err != nil && err != io.EOF {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why panic and not return an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplifies the Encode/Decode interface, and isn't recoverable from if it fails anyways.

func GetCodec(typ Compression) Codec {
ret, ok := codecs[typ]
if !ok {
// return codecs[Codecs.Uncompressed]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require further thought on defaults? or should the comment go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now i liked the idea of erroring when trying to retrieve a codec if we haven't implemented it rather than silently returning the uncompressed one.

The alternative here to panicing would be to change this to return (Codec, error) and have it return nil and an unimplemented error if it can't find the desired codec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use panics when absolutely necessary. If a codec can't be found, the way to do that in Go is to return an error. This is how I implemented it:

https://github.com/nickpoorman/arrow/blob/ARROW-7905-go-parquet/go/parquet/parquet/compress/compress.go#L31

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickpoorman

So, between the ideas of just defaulting down to returning Uncompressed vs modifying this to return (Codec, error) and returning an error, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've updated this to return (Codec, error) now instead of panic'ing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think returning the error is the right move

go/parquet/compress/compress_test.go Outdated Show resolved Hide resolved

wr := codec.NewWriter(&buf)

const chunkSize = 1111
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1111?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pulled this test from the C++ implementation tests. Ultimately it's because it's a number that is small enough to make sure we'll have multiple chunks but large enough that it'll have some compression it can do :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised C++ doesn't use a power of 2 (1024?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent was to ensure that there is a chunk at the end which is not a full chunk to make sure we test that handling properly.

func (nocodec) CompressBound(len int64) int64 { return len }

func init() {
codecs[Codecs.Uncompressed] = nocodec{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the name, but any thoughts on Codecs.Identity - for example, https://grpc.github.io/grpc-java/javadoc/io/grpc/Codec.Identity.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i prefer leaving it as Uncompressed for the name.

@emkornfield
Copy link
Contributor

sorry this week is particularly bad. I will try to review on the weekend/next week.

@emkornfield
Copy link
Contributor

I removed the "@" mentions in the description. It appears I get notified everytime someone clones the commit in master. Please tag people as a first comment instead.

@zeroshade
Copy link
Member Author

@emkornfield Did not realize that, will keep that in mind for future PRs. Sorry!

@zeroshade
Copy link
Member Author

rebased from master

@zeroshade
Copy link
Member Author

@emkornfield @sbinet @nickpoorman Bump on getting reviews here! thanks!

@emkornfield
Copy link
Contributor

will try to do a first pass today or tomorrow.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a pass through and looks reasonable. I'm not to familiar with encryption stuff. @ggershinsky to see if he wants to take a look.

go/parquet/compress/brotli.go Outdated Show resolved Hide resolved
go/parquet/compress/brotli.go Show resolved Hide resolved
go/parquet/compress/compress.go Show resolved Hide resolved
// NewWriter provides a wrapper around a write stream to compress data before writing it.
NewWriter(io.Writer) io.WriteCloser
// NewWriterLevel is like NewWrapper but allows specifying the compression level
NewWriterLevel(io.Writer, int) (io.WriteCloser, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API comment. It is worth exposing compression level in the API, an laternative would have be a member used at construction time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way I built this was to minimize object creation, so the map contains objects which are stateless to ensure we have no race conditions, the objects themselves know how to construct the appropriate encoders and decoders, that's why I exposed it like this since not all of the libraries allow configuring the compression level after constructing the encoder. Most of them take the compression level as an argument to their NewWriter functions etc. which is what I modeled this after.

go/parquet/compress/compress.go Outdated Show resolved Hide resolved
go/parquet/compress/zstd.go Outdated Show resolved Hide resolved
go/parquet/reader_properties.go Show resolved Hide resolved
Required Repetition
Optional Repetition
Repeated Repetition
Undefined Repetition // convenience value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might pay to distinguish beween undefined and not set. An issue was raised recently that at least in C++ we write Repetition required for the root of the schema when according to the spec we probably shouldn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case Undefined == not set it explicitly only exists to have an available option for "Not Set". Since "undefined" is not a legitimate value for the parquet spec. If you'd prefer I can change the name of this to be NotSet rather than Undefined.

}

// WithMaxRowGroupLength specifies the number of rows as the maximum number of rows for a given row group in the writer.
func WithMaxRowGroupLength(nrows int64) WriterProperty {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this was mirrored on the C++ implementation intentionally, just wanted to check that this pattern is idiomatic go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ implementation uses a Builder pattern that constructs an object with a bunch of "Set...." functions on it and then eventually calls "Build" to generate the resulting object. This follows the idiomatic Go pattern for taking options, which is the same pattern used in the Arrow Go library for things like the ipc.NewWriter etc.

Essentially it ends up looking like:

parquet.NewWriterProperties(parquet.WithMaxRowGroupLength(nrows), parquet.With.........)

where NewWriterProperties can take any number of these WriterProperty options including none which would just produce the default WriterProperties.

@ggershinsky
Copy link
Contributor

Took a pass through and looks reasonable. I'm not to familiar with encryption stuff. @ggershinsky to see if he wants to take a look.

will be glad to

@ggershinsky
Copy link
Contributor

had a quick look at the encryption part, no specific comments, seems to be well structured and coded; as long as it interoperates with its C++ counterpart (can read/decrypt the files written/encrypted by the other), it should be ok - but this is applicable to all other parquet features as well :)

this pull request covers the so-called "low level" encryption layer. On top of it, parquet also has a "high-level" encryption layer, please see the explanation of differences and reasons. The high level layer is already released in Java parquet (v1.12.0), that went into Spark master for the next Spark release. This layer is also implemented in C++, and merged in Arrow (#8023). You might consider adding this layer to the Go implementation as well, so it will be able to interop with Spark and PyArrow, and will benefit from the additional security checks/features.

@zeroshade
Copy link
Member Author

@ggershinsky So the completed library does use the files in the parquet-test-data github submodule in order to confirm that I'm able to read/decrypt the files there in addition to the files written by itself, so that ensures the interoperability.

I'll take a look at the "high-level" encryption layer and see how difficult it would be to implement / add. Depending on that I may add it as a separate PR rather than adding it to this one, if that's ok?

@ggershinsky
Copy link
Contributor

Sounds good. My intention was to suggest the high-level layer as a future separate pull request, I should have been more explicit about this.

@zeroshade
Copy link
Member Author

zeroshade commented Apr 13, 2021

After looking at it a bit, I agree that it's definitely a good idea to add the high-level layer as a future enhancement after I finish getting the rest of the parquet impl merged :) Thanks for the suggestion @ggershinsky

@emkornfield @sbinet @nickpoorman any other comments on this one or am i good to go? :)

I've got about 4 more of these, so I'm getting antsy! haha 😝

emkornfield pushed a commit that referenced this pull request May 20, 2021
Following up from #9817  this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing.

Closes #10071 from zeroshade/arrow-12424

Authored-by: Matthew Topol <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

Closes apache#9817 from zeroshade/arrow-12104

Authored-by: Matthew Topol <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

Closes apache#9817 from zeroshade/arrow-12104

Authored-by: Matthew Topol <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

Closes apache#9817 from zeroshade/arrow-12104

Authored-by: Matthew Topol <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Following up from apache#9817  this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing.

Closes apache#10071 from zeroshade/arrow-12424

Authored-by: Matthew Topol <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
@zeroshade zeroshade deleted the arrow-12104 branch September 12, 2021 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants