Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45028: [C++][Compute] Allow cast to reorder struct fields #45246

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Jan 13, 2025

Rationale for this change

When reading a parquet dataset where the physical schema has inconsistent column order for top level columns Arrow can still read the table. However it cannot handle similar inconsistency in the order of struct fields and raises errors like

Traceback (most recent call last):
  File "/home/tomnewton/arrow/cpp/src/arrow/compute/example.py", line 30, in <module>
    table_read = pq.read_table(
  File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1843, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1485, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: int32, sub_column1: int32> output fields: struct<sub_column1: int32, sub_column0: int32>

This issue is quite closely related to #44555

What changes are included in this PR?

Change the implementation of CastStruct::Exec to be primarily based on the column names rather than the column order. Each input field can still only be used once and if there are many input fields with the same name they will be used in the order of the input fields.

Alternatives I considered:
Implement this behaviour in the same place as the equivalent logic for top level columns at

ARROW_ASSIGN_OR_RAISE(auto column, field_ref.GetOneOrNone(partial_batch));
. This would effect parquet scans without modifying cast behaviour.
I decided against this because I want this behaviour to work recursively e.g. if there are nested structs or structs inside arrays of maps, etc.

Have a config option to switch between field name and field order based matching. This would make things more explicit but there would be 2 code paths to maintain instead of one.
IMO the logic I've implemented where each input can only be used once and column order is maintained for duplicate names achieves what I want without breaking any usecases that rely on column order and without too much complexity. So I decided a config option was not necessary.

Are these changes tested?

Yes. A few new assertions were added but mostly it was a case of adjusting the expected behaviour on existing tests.

Are there any user-facing changes?

Yes. Casts that require changing the struct field order will now succeed without error.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@Tom-Newton Tom-Newton changed the title Tomnewton/cast can reorder struct fields/gh 45028 GH-45028: Allow cast to reorder struct fields Jan 14, 2025
Copy link

⚠️ GitHub issue #45028 has been automatically assigned in GitHub to PR creator.

@Tom-Newton Tom-Newton changed the title GH-45028: Allow cast to reorder struct fields WIP GH-45028: [C++][Compute] Allow cast to reorder struct fields Jan 14, 2025
@Tom-Newton Tom-Newton force-pushed the tomnewton/cast_can_reorder_struct_fields/GH-45028 branch from 7a3a569 to c9a10fd Compare January 14, 2025 14:24
@Tom-Newton Tom-Newton force-pushed the tomnewton/cast_can_reorder_struct_fields/GH-45028 branch from 31a3e21 to 9b948ac Compare January 14, 2025 15:33
@Tom-Newton Tom-Newton changed the title WIP GH-45028: [C++][Compute] Allow cast to reorder struct fields GH-45028: [C++][Compute] Allow cast to reorder struct fields Jan 14, 2025
@Tom-Newton
Copy link
Contributor Author

I think I'm happy with my implementation so I'm going to mark this ready for review. I understand that the concept may be slightly controversial though so I'm happy to discuss if anyone disagrees.

@Tom-Newton Tom-Newton marked this pull request as ready for review January 14, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant