Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Kernel to select subset of fields of a StructArray #31101

Closed
asfimport opened this issue Feb 10, 2022 · 9 comments
Closed

[C++] Kernel to select subset of fields of a StructArray #31101

asfimport opened this issue Feb 10, 2022 · 9 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Feb 10, 2022

Triggered by https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure. I thought there was already an issue about this, but don't directly find one.

Assume you have a struct array with some fields:

>>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
>>> arr.type
StructType(struct<a: int64, b: int64, c: int64>)

We have a kernel to select a single child field:

>>> pc.struct_field(arr, [0])
<pyarrow.lib.Int64Array object at 0x7ffa9e229940>
[
  1,
  2,
  3
]

But if you want to subset the StructArray to some of its fields, resulting in a new StructArray, that's not possible with struct_field, and doing this manually is a bit cumbersome:

>>> fields = ['a', 'c']
>>> arrays = [arr.field(n) for n in fields]
>>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
>>> arr_subset.type
StructType(struct<a: int64, c: int64>)

(this is still OK, but if you had a ChunkedArray, it certainly gets annoying)

One option could be to expand the existing struct_field to allow selecting multiple fields (although that probably gets ambigous/confusing with how you currently select a recursively nested field -> [0, 1] currently means "first child, second subchild" and not "first and second child").
Or a new kernel like "struct_subset" or some other name.

This might also overlap with general projection functionality? (cc @westonpace)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Dhruv Vats / @dhruv9vats

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-15643. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

&res / @0x26res:
Thanks for raising the issue.

I've noticed that you can't cast a struct array to a sub set of the struct. So for example:

import pyarrow as pa


struct_type = pa.struct(
    [pa.field("field1", pa.string()), pa.field("field2", pa.int32())]
)

sub_struct_type = pa.struct(
    [
        pa.field("field1", pa.string()),
    ]
)


struct_array = pa.array(
    [
        ("ABC", 123),
        ("EFG", 456),
    ],
    struct_type,
)

struct_array.cast(sub_struct_type)

Gives you:

    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 527, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 337, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<field1: string, field2: int32> to struct using function cast_struct

So one option would be to support this type of cast.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
In general we don't yet have implemented much casting support for structs (eg ARROW-1888 to cast the fields to a different type, which is currently being worked on. But I suppose that PR currently allows the cast only for the same field names and number of fields, i.e. only changing the type of the field).
But indeed, that would also be a way to support this functionality. I think such a cast would be useful to allow in any case. But I also might not directly think about using casting if I am looking to do a field selection (eg a Table and RecordBatch have a select method, and RecordBatch and StructArray are quite similar, so we could also have a StructArray.select method)

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
I think we can implement it as a cast, but have a utility method StructArray.select. We can extend ARROW-1888 indeed.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
ARROW-1888 has been merged now, and currently the cast is "strict", meaning that it requires the exact same number of fields with the same names in the same order. If we want to support this issue through a cast, this could be relaxed to:

  • allowing the fields of the target type to be a subset of the existing fields (but so no field names that are not present in the original array? Or also allow that in which case that field gets filled with nulls)

  • also allowing them to be in a different order?

    One thing I am wondering though, is whether we should consider this as a "safe" cast, or if we should add a new flag to the CastOptions to allow changing the fields of a struct

@asfimport
Copy link
Collaborator Author

Will Ayd / @WillAyd:
It feels right to me to allow the target type to subset the originating type. I'm not yet sure about the different order aspect. There is definitely some ambiguity that arises when the order changes. If you have an originating type of "x, y, z" and a target type of "z, y, x" I don't think its very clear if the alignment should be by name or by position - perhaps that is where the flag you are thinking about comes into play?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
We have a concrete use case for the "subset" version in scanning. Users can specify nested refs which can be satisfied in the parquet reader but not the CSV reader. So for the CSV case we need to be able to read in the full column and then cast down to the targetted struct the user is asking for in the nested ref.

I don't know about reordering but it might be needed for Substrait to support their emit property which I think can arbitrarily reorder columns, both at the batch level and any nested level in a struct.

I'm not sure what the rationale is for the "safe" flag. Are you saying it might be nice for users to say "do this cast if it can be done zero-copy but fail otherwise"?

@asfimport
Copy link
Collaborator Author

Dhruv Vats / @dhruv9vats:
If this has some priority, I'd be happy to work on this. But I'm not clear on what path are we settling on. Do we implement struct_subset kernel and the casting functionality differently? Or should it be like implement one and use that to implement the other?

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
I believe this should be a cast. For one, that means it will automatically make scanning better!

We can tackle the unambiguous cases first, and work on the ambiguous cases later. For instance, subsetting fields without changing order should be reasonable. We can later add a field to also allow reordering, and to handle various ambiguous cases that Will raised.

IMO, "safe" isn't about copying (all kernels copy, basically, though it would be good to optimize out copies for the struct fields if there's no type conversion), but is about whether the cast may produce invalid data or not, and whether the kernel should error or not. That isn't a concern here, it'll be passed down to the casts for the child fields.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Issue resolved by pull request 12724
#12724

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant