ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write #13770

zeroshade · 2022-08-01T19:53:28Z

No description provided.

github-actions · 2022-08-01T19:53:54Z

https://issues.apache.org/jira/browse/ARROW-17275

pitrou

Thanks for doing this! I'm not a Go developer, but I left a bunch of comments which are hope are not too far off...

pitrou · 2022-08-02T14:40:30Z

docs/source/status.rst

@@ -77,7 +77,7 @@ Data Types
 +-------------------+-------+-------+-------+------------+-------+-------+-------+
 | List              | ✓     | ✓     | ✓     | ✓          |  ✓    |  ✓    | ✓     |
 +-------------------+-------+-------+-------+------------+-------+-------+-------+
-| Large List        | ✓     | ✓     |       |            |       |  ✓    | ✓     |
+| Large List        | ✓     | ✓     | ✓     |            |       |  ✓    | ✓     |


Is the PR title incorrect? This is updating the compatibility matrix for Large List support, not Large String and Large Binary which are already ticked for Go.

You're absolutely right that I need to rename the PR (I ended up tying the change with the IPC fixes to my other changes adding the LargeList type so i just incorporated it). Also, I honestly don't know why the Large String and Large Binary were ticked, they weren't supported until this change.

pitrou · 2022-08-02T14:46:00Z

go/arrow/array/concat_test.go

+		}
+		return bldr.NewArray()
+	case arrow.LARGE_LIST:
+		valuesSize := size * 8


Hmm... so, the way I understand this code, size is the logical length of the list array and valuesSize the logical length of the child list array, right?

Meaning that the multiplier here is arbitrary and you could have e.g. valuesSize := size * 13 or something?

yea, I'm doing * 8 here since it's expected to be 64-bit integers for the test

Well, I understand that cts.largeOffsets generated size 64-bit offsets between 0 and valuesSize (inclusive).
But why does the number of list offsets be equal to 8 times the number of values in the child array?

size is the number of elements in the list array (which is why cts.largeOffsets outputs a slice with size+1 elements because offsets should be 1 + #of elems). I multiplied by 8 here just to have a larger child array than I was using with the initial List cases, thus the lengths of the list elements will be on average larger than the previous case.

TL;DR: size is the number of list elements (and size+1 is the number of offsets). valuesSize is the total length of the child array which will get sliced up into those list elements randomly.

pitrou · 2022-08-02T14:51:56Z

go/arrow/array/list.go

-	offsets *Int32Builder
+	offsets Builder
+
+	dt              arrow.DataType


Is it the actual list type or the offset type? (perhaps add a comment?)

dt is the actual List Type, there's a separate etype member which is the element type of the list.

The offset type (int32 vs int64) doesn't need to matter for the base listbuilder object and is hidden behind the Builder interface, casted when necessary to either an Int32Builder or Int64Builder as appropriate. I'll add a comment.

go/arrow/array/list.go

pitrou · 2022-08-02T14:55:23Z

go/arrow/array/list.go

-func (b *ListBuilder) appendNextOffset() {
-	b.offsets.Append(int32(b.values.Len()))
+func (b *listBuilder) appendNextOffset() {
+	b.appendOffsetVal(b.values.Len())


So this is going through a function pointer and/or vtable I assume? I don't know how much optimized you expect builders to be, but is some inlining automatically done by the compiler here when the derived listBuilder type is known?

Yea, right now it's going through a function pointer. I don't know offhand whether or not the Go compiler would be able to inline this with the function pointer (though I can take a look at the generated assembly and see), my initial assumption would be that it's not going to inline this and will go through the function pointer each time.

It was a decision I made to allow the code reuse at the expense of some performance in the builder.

There are a lot of cases I would prefer to use go's generics for (like this one), but we're trying to maintain compatibility for Go versions latest-1 (ie: go1.17+) so I don't want to introduce usages of generics just yet until Go1.19 is released.

go/arrow/cdata/cdata.go

go/arrow/ipc/writer.go

pitrou · 2022-08-02T15:47:38Z

go/arrow/datatype_binary.go

@@ -16,6 +16,10 @@

 package arrow

+type OffsetTraits interface {
+	BytesRequired(int) int


Should this get a docstring?

Also, does this need to take and return int64 instead?

Currently all of the *Traits objects (defined in the type_traits_*.go files) take in an int and return an int. On 64-bit machines this would be a 64-bit integer and on 32-bit machines it would be a 32-bit integer.

I didn't want to change those objects (and risk breaking anyone who was using things like arrow.TimestampTraits.BytesRequired(n)). In this situation I just created an interface which matched those objects so I could have the data type objects return them. I'll add a docstring comment, but I don't want to modify it to be explicitly int64 as that would be a signiifcant, potentially breaking, change.

zeroshade · 2022-08-02T17:38:56Z

Thanks for the thorough review @pitrou I've pushed a bunch of refactoring and simplifying changes based on your suggestions by leveraging some new Interfaces. The side benefit of this is that the new interfaces will also make it easier for consumers to use the new Array types interchangeably too.

zeroshade · 2022-08-02T17:39:59Z

Also the failed CI seems to be an issue with python linting.... anyone know if there's already a jira issue to address that?

pitrou · 2022-08-02T18:12:44Z

Also the failed CI seems to be an issue with python linting.... anyone know if there's already a jira issue to address that?

cc @raulcd

zeroshade · 2022-08-02T18:16:44Z

Linting issue is being fixed by #13778

pitrou

LGTM. I don't know if a Go developer wants to review? @raceordie690 @wolfeidau

zeroshade · 2022-08-03T13:39:32Z

Thanks @pitrou

I'd love it if another Go developer gave a review, I'll let this sit for another day to give people a chance and if there are no comments or anything I'll merge it.

wolfeidau

I had a read over the code, not an expert on the internals of arrow but nothing struck me as out of place.

👍🏻 nice work

go/arrow/array/list.go

Co-authored-by: Antoine Pitrou <[email protected]>

zeroshade · 2022-08-03T17:16:06Z

@raulcd

The Python failure here doesn't appear related to this change at all:

=================================== FAILURES ===================================
___________________ [doctest] pyarrow.parquet.read_metadata ____________________
3406 
3407     Examples
3408     --------
3409     >>> import pyarrow as pa
3410     >>> import pyarrow.parquet as pq
3411     >>> table = pa.table({'n_legs': [4, 5, 100],
3412     ...                   'animal': ["Dog", "Brittle stars", "Centipede"]})
3413     >>> pq.write_table(table, 'example.parquet')
3414 
3415     >>> pq.read_metadata('example.parquet')
Differences (unified diff with -expected +actual):
    @@ -1,7 +1,7 @@
    -<pyarrow._parquet.FileMetaData object at ...>
    -  created_by: parquet-cpp-arrow version ...
    +<pyarrow._parquet.FileMetaData object at 0x7f3c48eb4750>
    +  created_by: parquet-cpp-arrow version 10.0.0-SNAPSHOT
       num_columns: 2
       num_rows: 3
       num_row_groups: 1
       format_version: 2.6
    -  serialized_size: 561
    +  serialized_size: 562

zeroshade · 2022-08-03T19:13:48Z

thanks for the look over @wolfeidau!

ursabot · 2022-08-03T22:02:50Z

Benchmark runs are scheduled for baseline = 3b987d9 and contender = db6c099. db6c099 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.34% ⬆️0.0%] test-mac-arm
[Finished ⬇️6.25% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] db6c099b ec2-t3-xlarge-us-east-2
[Failed] db6c099b test-mac-arm
[Finished] db6c099b ursa-i9-9960x
[Finished] db6c099b ursa-thinkcentre-m75q
[Failed] 3b987d92 ec2-t3-xlarge-us-east-2
[Finished] 3b987d92 test-mac-arm
[Finished] 3b987d92 ursa-i9-9960x
[Finished] 3b987d92 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-08-03T22:03:13Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

zeroshade requested review from amol-, nealrichardson and emkornfield August 1, 2022 19:53

github-actions bot added Component: Go Component: Documentation labels Aug 1, 2022

pitrou reviewed Aug 2, 2022

View reviewed changes

zeroshade changed the title ~~ARROW-17275: [Go][Integration] Handle LargeString/LargeBinary in IPC and integration tests~~ ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write Aug 2, 2022

pitrou approved these changes Aug 3, 2022

View reviewed changes

wolfeidau approved these changes Aug 3, 2022

View reviewed changes

go/arrow/array/list.go Outdated Show resolved Hide resolved

zeroshade and others added 12 commits August 3, 2022 11:42

Handle Large Offset primitives in IPC and integration tests

df7f105

handle large binary and large string in cdata and scalar pkgs

cb2f193

fix missing format strings for LargeString/LargeBinary

312ed14

implement large list handling

85e9f5c

missed a spot in test generation

51b57e3

add endian swap for large types

716de4f

Update go/arrow/array/list.go

6ec8283

Co-authored-by: Antoine Pitrou <[email protected]>

Apply suggestions from code review

4e07896

Co-authored-by: Antoine Pitrou <[email protected]>

slight refactoring based on feedback to reduce duplication

f5478df

add docstring comment to OffsetTraits interface

9c92586

add -asan to go tests

807cdf1

remove -asan for now

b99f9e1

zeroshade force-pushed the large-offsets-ipc branch from b662e31 to b99f9e1 Compare August 3, 2022 15:44

style consistency

941aafb

zeroshade merged commit db6c099 into apache:master Aug 3, 2022

zeroshade deleted the large-offsets-ipc branch August 3, 2022 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write #13770

ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write #13770

zeroshade commented Aug 1, 2022

github-actions bot commented Aug 1, 2022

pitrou left a comment

pitrou Aug 2, 2022

zeroshade Aug 2, 2022

pitrou Aug 2, 2022

zeroshade Aug 2, 2022

pitrou Aug 2, 2022

zeroshade Aug 2, 2022

pitrou Aug 2, 2022

zeroshade Aug 2, 2022

pitrou Aug 2, 2022

zeroshade Aug 2, 2022

pitrou Aug 2, 2022

pitrou Aug 2, 2022

zeroshade Aug 2, 2022

zeroshade commented Aug 2, 2022

zeroshade commented Aug 2, 2022

pitrou commented Aug 2, 2022

zeroshade commented Aug 2, 2022

pitrou left a comment

zeroshade commented Aug 3, 2022

wolfeidau left a comment

zeroshade commented Aug 3, 2022 •

edited

Loading

zeroshade commented Aug 3, 2022

ursabot commented Aug 3, 2022

ursabot commented Aug 3, 2022

ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write #13770

ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write #13770

Conversation

zeroshade commented Aug 1, 2022

github-actions bot commented Aug 1, 2022

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade commented Aug 2, 2022

zeroshade commented Aug 2, 2022

pitrou commented Aug 2, 2022

zeroshade commented Aug 2, 2022

pitrou left a comment

Choose a reason for hiding this comment

zeroshade commented Aug 3, 2022

wolfeidau left a comment

Choose a reason for hiding this comment

zeroshade commented Aug 3, 2022 • edited Loading

zeroshade commented Aug 3, 2022

ursabot commented Aug 3, 2022

ursabot commented Aug 3, 2022

zeroshade commented Aug 3, 2022 •

edited

Loading