-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Kernel to select subset of fields of a StructArray #31101
Comments
&res / @0x26res: I've noticed that you can't cast a struct array to a sub set of the struct. So for example: import pyarrow as pa
struct_type = pa.struct(
[pa.field("field1", pa.string()), pa.field("field2", pa.int32())]
)
sub_struct_type = pa.struct(
[
pa.field("field1", pa.string()),
]
)
struct_array = pa.array(
[
("ABC", 123),
("EFG", 456),
],
struct_type,
)
struct_array.cast(sub_struct_type) Gives you: return call_function("cast", [arr], options)
File "pyarrow/_compute.pyx", line 527, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 337, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<field1: string, field2: int32> to struct using function cast_struct So one option would be to support this type of cast. |
Joris Van den Bossche / @jorisvandenbossche: |
David Li / @lidavidm: |
Joris Van den Bossche / @jorisvandenbossche:
|
Will Ayd / @WillAyd: |
Weston Pace / @westonpace: I don't know about reordering but it might be needed for Substrait to support their emit property which I think can arbitrarily reorder columns, both at the batch level and any nested level in a struct. I'm not sure what the rationale is for the "safe" flag. Are you saying it might be nice for users to say "do this cast if it can be done zero-copy but fail otherwise"? |
Dhruv Vats / @dhruv9vats: |
David Li / @lidavidm: We can tackle the unambiguous cases first, and work on the ambiguous cases later. For instance, subsetting fields without changing order should be reasonable. We can later add a field to also allow reordering, and to handle various ambiguous cases that Will raised. IMO, "safe" isn't about copying (all kernels copy, basically, though it would be good to optimize out copies for the struct fields if there's no type conversion), but is about whether the cast may produce invalid data or not, and whether the kernel should error or not. That isn't a concern here, it'll be passed down to the casts for the child fields. |
Triggered by https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure. I thought there was already an issue about this, but don't directly find one.
Assume you have a struct array with some fields:
We have a kernel to select a single child field:
But if you want to subset the StructArray to some of its fields, resulting in a new StructArray, that's not possible with
struct_field
, and doing this manually is a bit cumbersome:(this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
One option could be to expand the existing
struct_field
to allow selecting multiple fields (although that probably gets ambigous/confusing with how you currently select a recursively nested field -> [0, 1] currently means "first child, second subchild" and not "first and second child").Or a new kernel like "struct_subset" or some other name.
This might also overlap with general projection functionality? (cc @westonpace)
Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Dhruv Vats / @dhruv9vats
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-15643. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: