Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12424: [Go][Parquet] Adding Schema Package for Go Parquet #10071

Closed
wants to merge 8 commits into from

Conversation

zeroshade
Copy link
Member

Following up from #9817 this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing.

@zeroshade
Copy link
Member Author

Tagging @emkornfield @sbinet @nickpoorman for visability

@github-actions
Copy link

@zeroshade
Copy link
Member Author

bumping for some visibility to beg for reviews :)

func (c *Column) ConvertedType() ConvertedType { return c.pnode.convertedType }
func (c *Column) LogicalType() LogicalType { return c.pnode.logicalType }
func (c *Column) ColumnOrder() parquet.ColumnOrder { return c.pnode.ColumnOrder }
func (c *Column) String() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm surprised with Go's love of abreviations the String method is actually a full word :)

@emkornfield
Copy link
Contributor

Started reviewing, left some comments, will try to finish all files over the next few days.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the comments are around commenting literals. This potentially represents a lot of drudge work but I think in the long run would make the code easier to read/maintain for new comers (or even people who aren't fresh). Happy to figure out a reasonable path forward here.

//
// key node will be renamed to "key", value node if not nil will be renamed to "value"
//
// <map-repetition> must be only optional or required. panics if repeated is passed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments.

go/parquet/schema/helpers.go Outdated Show resolved Hide resolved
type LogicalType interface {
// Returns true if a nested type like List or Map
IsNested() bool
// Returns true if this type can be serialized, ie: not Unknown/NoType/Interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't interval be serialized? is this generated code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L340

The current parquet.thrift file has the logical type for INTERVAL commented out so the generated code doesn't contain a LogicalType for Interval and thus it's unable to be serialized. This also comes from the C++ implementation too https://github.com/apache/arrow/blob/master/cpp/src/parquet/schema.cc#L520

go/parquet/schema/logical_types_test.go Outdated Show resolved Hide resolved
go/parquet/schema/node.go Outdated Show resolved Hide resolved
go/parquet/schema/node.go Show resolved Hide resolved
go/parquet/schema/reflection.go Show resolved Hide resolved
go/parquet/schema/reflection_test.go Show resolved Hide resolved
go/parquet/schema/schema_element_test.go Outdated Show resolved Hide resolved
go/parquet/schema/schema_flatten_test.go Outdated Show resolved Hide resolved
@zeroshade
Copy link
Member Author

@emkornfield i've added the comments requested and docs and answered questions. Lemme know if there's anything else we need.

@zeroshade
Copy link
Member Author

bump 😄

@nickpoorman
Copy link
Contributor

I'm not a fan of the panics instead of returning errors. Having the panics means if I'm running this in a production environment I have to wrap everything I do with Arrow in a rescue defer which is going to slow things down.

@zeroshade
Copy link
Member Author

@nickpoorman so the tact I've taken in here has been essentially if something is a programmer error such as trying to create an integer node with string logical type, and is unrecoverable I might panic. But if something fails based on user inputs or in normal operation, then I return an error.

I've done a lot of looking to try to reduce the panics in here, is there any specific ones or patterns in here you think should be an error where I'm doing a panic?

@zeroshade
Copy link
Member Author

@nickpoorman my latest update here significantly reduces the panics and replaces them with returning errors, but also provides convenience functions Must, MustPrimitive, and MustGroup which can make it easier to build schema's without having to check for errors constantly if the consumer is ok with panic'ing.

@zeroshade
Copy link
Member Author

@emkornfield @nickpoorman @sbinet any chance at getting re-reviewed and merged?

@emkornfield
Copy link
Contributor

Yes, sorry, last week was busy with other things, I'll take a look at this and other open PRs tomorrow and Wednesday. f you don't hear anything by wednesday evening please try to ping me again.

@zeroshade
Copy link
Member Author

Will do. thanks much

@emkornfield
Copy link
Contributor

Sorry didn't get a chance to look at this today, will see if I can squeeze in a review tonight otherwise, I'll prioritize tomorrow morning.

@emkornfield
Copy link
Contributor

@zeroshade left some comments but still going through this, will try to finish this evening.

@zeroshade
Copy link
Member Author

The integration test failure i believe has nothing to do with this change as far as i can tell

scale = 0
)
if info != nil { // we have struct tag info to process
fieldID = info.FieldID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note there has been a recent issue/PR that is changing the logic around fieldID (c++ code will not generate them any more)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I modify this to no longer auto-generate the fieldIDs? is there a specific reason why the C++ code won't generate them anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR #10289 for details, essentially the field IDs in thrift are mean for other system to set for schema evolution/conversion purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. No specific reason outside of reducing complexity/noise. Arrow simply does not have enough information to generate anything meaningful here (the current "field order" assignment is not really informative and could be easily regenerated by a user if they truly needed it).

Instead the C++ implementation will just pass through the value to/from the parquet layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I'm fine with pulling out the automatic field id generation here. The current implementation i have here would persist user set ids and only auto generates if they are left as -1, but it's easy to just leave the -1's or pass through what is given. I'll comment after I push that change.


// reverse the order of the list in place so that our result
// is in the proper, correct order.
for i := len(c)/2 - 1; i >= 0; i-- {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go doesn't have a built in reverse function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, there's no built-in reverse for slices. There's a few simple algorithmic things like that which exist in the C++ stdlib but Go doesn't have simply because of the philosophy Go has in trying not to hide complexity where it can.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small question, otherwise LGTM thank you for your patience.

@zeroshade
Copy link
Member Author

@emkornfield I've removed the auto-generation of the fieldIDs and updated the PR. So i think we're all set here! 😄

michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Following up from apache#9817  this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing.

Closes apache#10071 from zeroshade/arrow-12424

Authored-by: Matthew Topol <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
@zeroshade zeroshade deleted the arrow-12424 branch September 12, 2021 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants