-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write #13770
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I'm not a Go developer, but I left a bunch of comments which are hope are not too far off...
@@ -77,7 +77,7 @@ Data Types | |||
+-------------------+-------+-------+-------+------------+-------+-------+-------+ | |||
| List | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | |||
+-------------------+-------+-------+-------+------------+-------+-------+-------+ | |||
| Large List | ✓ | ✓ | | | | ✓ | ✓ | | |||
| Large List | ✓ | ✓ | ✓ | | | ✓ | ✓ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the PR title incorrect? This is updating the compatibility matrix for Large List support, not Large String and Large Binary which are already ticked for Go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right that I need to rename the PR (I ended up tying the change with the IPC fixes to my other changes adding the LargeList type so i just incorporated it). Also, I honestly don't know why the Large String and Large Binary were ticked, they weren't supported until this change.
} | ||
return bldr.NewArray() | ||
case arrow.LARGE_LIST: | ||
valuesSize := size * 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... so, the way I understand this code, size
is the logical length of the list array and valuesSize
the logical length of the child list array, right?
Meaning that the multiplier here is arbitrary and you could have e.g. valuesSize := size * 13
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I'm doing * 8 here since it's expected to be 64-bit integers for the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I understand that cts.largeOffsets
generated size
64-bit offsets between 0 and valuesSize
(inclusive).
But why does the number of list offsets be equal to 8 times the number of values in the child array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size
is the number of elements in the list array (which is why cts.largeOffsets
outputs a slice with size+1
elements because offsets should be 1 + #of elems). I multiplied by 8 here just to have a larger child array than I was using with the initial List cases, thus the lengths of the list elements will be on average larger than the previous case.
TL;DR: size
is the number of list elements (and size+1
is the number of offsets). valuesSize
is the total length of the child array which will get sliced up into those list elements randomly.
offsets *Int32Builder | ||
offsets Builder | ||
|
||
dt arrow.DataType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the actual list type or the offset type? (perhaps add a comment?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dt
is the actual List Type, there's a separate etype
member which is the element
type of the list.
The offset type (int32 vs int64) doesn't need to matter for the base listbuilder
object and is hidden behind the Builder
interface, casted when necessary to either an Int32Builder
or Int64Builder
as appropriate. I'll add a comment.
func (b *ListBuilder) appendNextOffset() { | ||
b.offsets.Append(int32(b.values.Len())) | ||
func (b *listBuilder) appendNextOffset() { | ||
b.appendOffsetVal(b.values.Len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is going through a function pointer and/or vtable I assume? I don't know how much optimized you expect builders to be, but is some inlining automatically done by the compiler here when the derived listBuilder type is known?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, right now it's going through a function pointer. I don't know offhand whether or not the Go compiler would be able to inline this with the function pointer (though I can take a look at the generated assembly and see), my initial assumption would be that it's not going to inline this and will go through the function pointer each time.
It was a decision I made to allow the code reuse at the expense of some performance in the builder.
There are a lot of cases I would prefer to use go's generics for (like this one), but we're trying to maintain compatibility for Go versions latest-1 (ie: go1.17+) so I don't want to introduce usages of generics just yet until Go1.19 is released.
@@ -16,6 +16,10 @@ | |||
|
|||
package arrow | |||
|
|||
type OffsetTraits interface { | |||
BytesRequired(int) int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this get a docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this need to take and return int64
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently all of the *Traits
objects (defined in the type_traits_*.go files) take in an int
and return an int
. On 64-bit machines this would be a 64-bit integer and on 32-bit machines it would be a 32-bit integer.
I didn't want to change those objects (and risk breaking anyone who was using things like arrow.TimestampTraits.BytesRequired(n)
). In this situation I just created an interface which matched those objects so I could have the data type objects return them. I'll add a docstring comment, but I don't want to modify it to be explicitly int64
as that would be a signiifcant, potentially breaking, change.
Thanks for the thorough review @pitrou I've pushed a bunch of refactoring and simplifying changes based on your suggestions by leveraging some new Interfaces. The side benefit of this is that the new interfaces will also make it easier for consumers to use the new Array types interchangeably too. |
Also the failed CI seems to be an issue with python linting.... anyone know if there's already a jira issue to address that? |
cc @raulcd |
Linting issue is being fixed by #13778 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't know if a Go developer wants to review? @raceordie690 @wolfeidau
Thanks @pitrou I'd love it if another Go developer gave a review, I'll let this sit for another day to give people a chance and if there are no comments or anything I'll merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a read over the code, not an expert on the internals of arrow but nothing struck me as out of place.
👍🏻 nice work
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
b662e31
to
b99f9e1
Compare
The Python failure here doesn't appear related to this change at all: =================================== FAILURES ===================================
___________________ [doctest] pyarrow.parquet.read_metadata ____________________
3406
3407 Examples
3408 --------
3409 >>> import pyarrow as pa
3410 >>> import pyarrow.parquet as pq
3411 >>> table = pa.table({'n_legs': [4, 5, 100],
3412 ... 'animal': ["Dog", "Brittle stars", "Centipede"]})
3413 >>> pq.write_table(table, 'example.parquet')
3414
3415 >>> pq.read_metadata('example.parquet')
Differences (unified diff with -expected +actual):
@@ -1,7 +1,7 @@
-<pyarrow._parquet.FileMetaData object at ...>
- created_by: parquet-cpp-arrow version ...
+<pyarrow._parquet.FileMetaData object at 0x7f3c48eb4750>
+ created_by: parquet-cpp-arrow version 10.0.0-SNAPSHOT
num_columns: 2
num_rows: 3
num_row_groups: 1
format_version: 2.6
- serialized_size: 561
+ serialized_size: 562 |
thanks for the look over @wolfeidau! |
Benchmark runs are scheduled for baseline = 3b987d9 and contender = db6c099. db6c099 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
No description provided.