ARROW-12424: [Go][Parquet] Adding Schema Package for Go Parquet #10071

zeroshade · 2021-04-16T15:47:27Z

Following up from #9817 this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing.

zeroshade · 2021-04-16T15:47:45Z

Tagging @emkornfield @sbinet @nickpoorman for visability

github-actions · 2021-04-16T15:49:50Z

https://issues.apache.org/jira/browse/ARROW-12424

zeroshade · 2021-04-20T13:58:09Z

bumping for some visibility to beg for reviews :)

go/parquet/internal/debug/log_off.go

go/parquet/schema/column.go

emkornfield · 2021-04-20T15:58:44Z

go/parquet/schema/column.go

+func (c *Column) ConvertedType() ConvertedType     { return c.pnode.convertedType }
+func (c *Column) LogicalType() LogicalType         { return c.pnode.logicalType }
+func (c *Column) ColumnOrder() parquet.ColumnOrder { return c.pnode.ColumnOrder }
+func (c *Column) String() string {


i'm surprised with Go's love of abreviations the String method is actually a full word :)

emkornfield · 2021-04-20T15:59:38Z

Started reviewing, left some comments, will try to finish all files over the next few days.

go/parquet/schema/column.go

go/parquet/schema/converted_types.go

go/parquet/schema/helpers.go

emkornfield

Most of the comments are around commenting literals. This potentially represents a lot of drudge work but I think in the long run would make the code easier to read/maintain for new comers (or even people who aren't fresh). Happy to figure out a reasonable path forward here.

emkornfield · 2021-04-25T21:42:31Z

go/parquet/schema/helpers.go

+//
+// key node will be renamed to "key", value node if not nil will be renamed to "value"
+//
+// <map-repetition> must be only optional or required. panics if repeated is passed.


Thanks for the comments.

go/parquet/schema/helpers.go

emkornfield · 2021-04-25T21:44:36Z

go/parquet/schema/logical_types.go

+type LogicalType interface {
+	// Returns true if a nested type like List or Map
+	IsNested() bool
+	// Returns true if this type can be serialized, ie: not Unknown/NoType/Interval


why can't interval be serialized? is this generated code?

https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L340

The current parquet.thrift file has the logical type for INTERVAL commented out so the generated code doesn't contain a LogicalType for Interval and thus it's unable to be serialized. This also comes from the C++ implementation too https://github.com/apache/arrow/blob/master/cpp/src/parquet/schema.cc#L520

go/parquet/schema/logical_types_test.go

go/parquet/schema/node.go

go/parquet/schema/reflection.go

go/parquet/schema/reflection_test.go

go/parquet/schema/schema_element_test.go

go/parquet/schema/schema_flatten_test.go

zeroshade · 2021-04-26T15:05:52Z

@emkornfield i've added the comments requested and docs and answered questions. Lemme know if there's anything else we need.

zeroshade · 2021-04-29T15:17:07Z

bump 😄

nickpoorman · 2021-04-29T21:27:07Z

I'm not a fan of the panics instead of returning errors. Having the panics means if I'm running this in a production environment I have to wrap everything I do with Arrow in a rescue defer which is going to slow things down.

zeroshade · 2021-04-29T23:12:25Z

@nickpoorman so the tact I've taken in here has been essentially if something is a programmer error such as trying to create an integer node with string logical type, and is unrecoverable I might panic. But if something fails based on user inputs or in normal operation, then I return an error.

I've done a lot of looking to try to reduce the panics in here, is there any specific ones or patterns in here you think should be an error where I'm doing a panic?

zeroshade · 2021-05-02T23:48:22Z

@nickpoorman my latest update here significantly reduces the panics and replaces them with returning errors, but also provides convenience functions Must, MustPrimitive, and MustGroup which can make it easier to build schema's without having to check for errors constantly if the consumer is ok with panic'ing.

zeroshade · 2021-05-10T23:35:48Z

@emkornfield @nickpoorman @sbinet any chance at getting re-reviewed and merged?

emkornfield · 2021-05-10T23:45:08Z

Yes, sorry, last week was busy with other things, I'll take a look at this and other open PRs tomorrow and Wednesday. f you don't hear anything by wednesday evening please try to ping me again.

zeroshade · 2021-05-11T00:40:28Z

Will do. thanks much

emkornfield · 2021-05-12T16:02:09Z

Sorry didn't get a chance to look at this today, will see if I can squeeze in a review tonight otherwise, I'll prioritize tomorrow morning.

go/parquet/schema/logical_types.go

emkornfield · 2021-05-13T16:18:51Z

@zeroshade left some comments but still going through this, will try to finish this evening.

zeroshade · 2021-05-13T21:01:02Z

The integration test failure i believe has nothing to do with this change as far as i can tell

emkornfield · 2021-05-17T15:55:27Z

go/parquet/schema/reflection.go

+		scale                 = 0
+	)
+	if info != nil { // we have struct tag info to process
+		fieldID = info.FieldID


just a note there has been a recent issue/PR that is changing the logic around fieldID (c++ code will not generate them any more)

Should I modify this to no longer auto-generate the fieldIDs? is there a specific reason why the C++ code won't generate them anymore?

See PR #10289 for details, essentially the field IDs in thrift are mean for other system to set for schema evolution/conversion purposes.

Yes. No specific reason outside of reducing complexity/noise. Arrow simply does not have enough information to generate anything meaningful here (the current "field order" assignment is not really informative and could be easily regenerated by a user if they truly needed it).

Instead the C++ implementation will just pass through the value to/from the parquet layer.

Fair enough. I'm fine with pulling out the automatic field id generation here. The current implementation i have here would persist user set ids and only auto generates if they are left as -1, but it's easy to just leave the -1's or pass through what is given. I'll comment after I push that change.

emkornfield · 2021-05-17T15:58:54Z

go/parquet/schema/node.go

+
+	// reverse the order of the list in place so that our result
+	// is in the proper, correct order.
+	for i := len(c)/2 - 1; i >= 0; i-- {


go doesn't have a built in reverse function?

Nope, there's no built-in reverse for slices. There's a few simple algorithmic things like that which exist in the C++ stdlib but Go doesn't have simply because of the philosophy Go has in trying not to hide complexity where it can.

emkornfield

one small question, otherwise LGTM thank you for your patience.

zeroshade · 2021-05-18T19:57:16Z

@emkornfield I've removed the auto-generation of the fieldIDs and updated the PR. So i think we're all set here! 😄

Following up from apache#9817 this is the next chunk of code for the Go Parquet port consisting of the Schema package, implementing the Converted and Logical types along with handling schema creation, manipulation, and printing. Closes apache#10071 from zeroshade/arrow-12424 Authored-by: Matthew Topol <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>

github-actions bot added the Component: Go label Apr 16, 2021