-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12104: [Go][Parquet] Second chunk of Ported Go Parquet code #9817
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sofar += n | ||
} | ||
if err != nil && err != io.EOF { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why panic and not return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplifies the Encode/Decode interface, and isn't recoverable from if it fails anyways.
go/parquet/compress/compress.go
Outdated
func GetCodec(typ Compression) Codec { | ||
ret, ok := codecs[typ] | ||
if !ok { | ||
// return codecs[Codecs.Uncompressed] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this require further thought on defaults? or should the comment go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now i liked the idea of erroring when trying to retrieve a codec if we haven't implemented it rather than silently returning the uncompressed one.
The alternative here to panicing would be to change this to return (Codec, error)
and have it return nil and an unimplemented error if it can't find the desired codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use panics when absolutely necessary. If a codec can't be found, the way to do that in Go is to return an error. This is how I implemented it:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, between the ideas of just defaulting down to returning Uncompressed
vs modifying this to return (Codec, error)
and returning an error, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've updated this to return (Codec, error)
now instead of panic'ing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think returning the error is the right move
|
||
wr := codec.NewWriter(&buf) | ||
|
||
const chunkSize = 1111 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 1111?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pulled this test from the C++ implementation tests. Ultimately it's because it's a number that is small enough to make sure we'll have multiple chunks but large enough that it'll have some compression it can do :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised C++ doesn't use a power of 2 (1024?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent was to ensure that there is a chunk at the end which is not a full chunk to make sure we test that handling properly.
func (nocodec) CompressBound(len int64) int64 { return len } | ||
|
||
func init() { | ||
codecs[Codecs.Uncompressed] = nocodec{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the name, but any thoughts on Codecs.Identity - for example, https://grpc.github.io/grpc-java/javadoc/io/grpc/Codec.Identity.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i prefer leaving it as Uncompressed for the name.
sorry this week is particularly bad. I will try to review on the weekend/next week. |
I removed the "@" mentions in the description. It appears I get notified everytime someone clones the commit in master. Please tag people as a first comment instead. |
@emkornfield Did not realize that, will keep that in mind for future PRs. Sorry! |
rebased from master |
some docs / comments update go.sum
@emkornfield @sbinet @nickpoorman Bump on getting reviews here! thanks! |
will try to do a first pass today or tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a pass through and looks reasonable. I'm not to familiar with encryption stuff. @ggershinsky to see if he wants to take a look.
// NewWriter provides a wrapper around a write stream to compress data before writing it. | ||
NewWriter(io.Writer) io.WriteCloser | ||
// NewWriterLevel is like NewWrapper but allows specifying the compression level | ||
NewWriterLevel(io.Writer, int) (io.WriteCloser, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API comment. It is worth exposing compression level in the API, an laternative would have be a member used at construction time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the way I built this was to minimize object creation, so the map contains objects which are stateless to ensure we have no race conditions, the objects themselves know how to construct the appropriate encoders and decoders, that's why I exposed it like this since not all of the libraries allow configuring the compression level after constructing the encoder. Most of them take the compression level as an argument to their NewWriter
functions etc. which is what I modeled this after.
Required Repetition | ||
Optional Repetition | ||
Repeated Repetition | ||
Undefined Repetition // convenience value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might pay to distinguish beween undefined and not set. An issue was raised recently that at least in C++ we write Repetition required for the root of the schema when according to the spec we probably shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case Undefined
== not set
it explicitly only exists to have an available option for "Not Set". Since "undefined" is not a legitimate value for the parquet spec. If you'd prefer I can change the name of this to be NotSet
rather than Undefined.
} | ||
|
||
// WithMaxRowGroupLength specifies the number of rows as the maximum number of rows for a given row group in the writer. | ||
func WithMaxRowGroupLength(nrows int64) WriterProperty { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this was mirrored on the C++ implementation intentionally, just wanted to check that this pattern is idiomatic go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ implementation uses a Builder pattern that constructs an object with a bunch of "Set...." functions on it and then eventually calls "Build" to generate the resulting object. This follows the idiomatic Go pattern for taking options, which is the same pattern used in the Arrow Go library for things like the ipc.NewWriter etc.
Essentially it ends up looking like:
parquet.NewWriterProperties(parquet.WithMaxRowGroupLength(nrows), parquet.With.........)
where NewWriterProperties
can take any number of these WriterProperty
options including none which would just produce the default WriterProperties.
will be glad to |
had a quick look at the encryption part, no specific comments, seems to be well structured and coded; as long as it interoperates with its C++ counterpart (can read/decrypt the files written/encrypted by the other), it should be ok - but this is applicable to all other parquet features as well :) this pull request covers the so-called "low level" encryption layer. On top of it, parquet also has a "high-level" encryption layer, please see the explanation of differences and reasons. The high level layer is already released in Java parquet (v1.12.0), that went into Spark master for the next Spark release. This layer is also implemented in C++, and merged in Arrow (#8023). You might consider adding this layer to the Go implementation as well, so it will be able to interop with Spark and PyArrow, and will benefit from the additional security checks/features. |
@ggershinsky So the completed library does use the files in the parquet-test-data github submodule in order to confirm that I'm able to read/decrypt the files there in addition to the files written by itself, so that ensures the interoperability. I'll take a look at the "high-level" encryption layer and see how difficult it would be to implement / add. Depending on that I may add it as a separate PR rather than adding it to this one, if that's ok? |
Sounds good. My intention was to suggest the high-level layer as a future separate pull request, I should have been more explicit about this. |
After looking at it a bit, I agree that it's definitely a good idea to add the high-level layer as a future enhancement after I finish getting the rest of the parquet impl merged :) Thanks for the suggestion @ggershinsky @emkornfield @sbinet @nickpoorman any other comments on this one or am i good to go? :) I've got about 4 more of these, so I'm getting antsy! haha 😝 |
Following up from #9817 this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing. Closes #10071 from zeroshade/arrow-12424 Authored-by: Matthew Topol <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling. Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read. Closes apache#9817 from zeroshade/arrow-12104 Authored-by: Matthew Topol <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling. Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read. Closes apache#9817 from zeroshade/arrow-12104 Authored-by: Matthew Topol <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling. Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read. Closes apache#9817 from zeroshade/arrow-12104 Authored-by: Matthew Topol <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Following up from apache#9817 this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing. Closes apache#10071 from zeroshade/arrow-12424 Authored-by: Matthew Topol <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Following up from #9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.
Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.