-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12424: [Go][Parquet] Adding Schema Package for Go Parquet #10071
Conversation
Tagging @emkornfield @sbinet @nickpoorman for visability |
bumping for some visibility to beg for reviews :) |
func (c *Column) ConvertedType() ConvertedType { return c.pnode.convertedType } | ||
func (c *Column) LogicalType() LogicalType { return c.pnode.logicalType } | ||
func (c *Column) ColumnOrder() parquet.ColumnOrder { return c.pnode.ColumnOrder } | ||
func (c *Column) String() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm surprised with Go's love of abreviations the String method is actually a full word :)
Started reviewing, left some comments, will try to finish all files over the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the comments are around commenting literals. This potentially represents a lot of drudge work but I think in the long run would make the code easier to read/maintain for new comers (or even people who aren't fresh). Happy to figure out a reasonable path forward here.
// | ||
// key node will be renamed to "key", value node if not nil will be renamed to "value" | ||
// | ||
// <map-repetition> must be only optional or required. panics if repeated is passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments.
type LogicalType interface { | ||
// Returns true if a nested type like List or Map | ||
IsNested() bool | ||
// Returns true if this type can be serialized, ie: not Unknown/NoType/Interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't interval be serialized? is this generated code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L340
The current parquet.thrift file has the logical type for INTERVAL commented out so the generated code doesn't contain a LogicalType for Interval and thus it's unable to be serialized. This also comes from the C++ implementation too https://github.com/apache/arrow/blob/master/cpp/src/parquet/schema.cc#L520
@emkornfield i've added the comments requested and docs and answered questions. Lemme know if there's anything else we need. |
bump 😄 |
I'm not a fan of the panics instead of returning errors. Having the panics means if I'm running this in a production environment I have to wrap everything I do with Arrow in a rescue defer which is going to slow things down. |
@nickpoorman so the tact I've taken in here has been essentially if something is a programmer error such as trying to create an integer node with string logical type, and is unrecoverable I might panic. But if something fails based on user inputs or in normal operation, then I return an error. I've done a lot of looking to try to reduce the panics in here, is there any specific ones or patterns in here you think should be an error where I'm doing a panic? |
@nickpoorman my latest update here significantly reduces the panics and replaces them with returning errors, but also provides convenience functions |
@emkornfield @nickpoorman @sbinet any chance at getting re-reviewed and merged? |
Yes, sorry, last week was busy with other things, I'll take a look at this and other open PRs tomorrow and Wednesday. f you don't hear anything by wednesday evening please try to ping me again. |
Will do. thanks much |
Sorry didn't get a chance to look at this today, will see if I can squeeze in a review tonight otherwise, I'll prioritize tomorrow morning. |
@zeroshade left some comments but still going through this, will try to finish this evening. |
The integration test failure i believe has nothing to do with this change as far as i can tell |
scale = 0 | ||
) | ||
if info != nil { // we have struct tag info to process | ||
fieldID = info.FieldID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note there has been a recent issue/PR that is changing the logic around fieldID (c++ code will not generate them any more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I modify this to no longer auto-generate the fieldIDs? is there a specific reason why the C++ code won't generate them anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See PR #10289 for details, essentially the field IDs in thrift are mean for other system to set for schema evolution/conversion purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. No specific reason outside of reducing complexity/noise. Arrow simply does not have enough information to generate anything meaningful here (the current "field order" assignment is not really informative and could be easily regenerated by a user if they truly needed it).
Instead the C++ implementation will just pass through the value to/from the parquet layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I'm fine with pulling out the automatic field id generation here. The current implementation i have here would persist user set ids and only auto generates if they are left as -1, but it's easy to just leave the -1's or pass through what is given. I'll comment after I push that change.
|
||
// reverse the order of the list in place so that our result | ||
// is in the proper, correct order. | ||
for i := len(c)/2 - 1; i >= 0; i-- { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go doesn't have a built in reverse function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, there's no built-in reverse for slices. There's a few simple algorithmic things like that which exist in the C++ stdlib but Go doesn't have simply because of the philosophy Go has in trying not to hide complexity where it can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one small question, otherwise LGTM thank you for your patience.
@emkornfield I've removed the auto-generation of the fieldIDs and updated the PR. So i think we're all set here! 😄 |
Following up from apache#9817 this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing. Closes apache#10071 from zeroshade/arrow-12424 Authored-by: Matthew Topol <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Following up from #9817 this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing.