Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute] Support casting struct fields to a different order #45028

Open
Tom-Newton opened this issue Dec 14, 2024 · 1 comment
Open

[C++][Compute] Support casting struct fields to a different order #45028

Tom-Newton opened this issue Dec 14, 2024 · 1 comment

Comments

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Dec 14, 2024

Describe the enhancement requested

Arrow can already cast to a different order columns but it can't do the same on struct fields.

I previously completed a related issue #44555 where it was agreed

in general if it works on tables but not structs (or vice versa) we should make them both work.

#44555 (comment)

Therefore by the same logic I think adding support for re-ordering columns is also reasonable.

Again my personal motivation is related to https://github.com/delta-io/delta-rs/. We want to be able to read any table and Spark sometimes writes parquet files with different physical schema field order compared to the order in the table's schema.

Component(s)

C++

@Tom-Newton
Copy link
Contributor Author

To clarify a bit what I'm talking about

This works

import pyarrow as pa
import pyarrow.parquet as pq

column0_array = pa.array([0, 1], type=pa.int32())
column1_array = pa.array([10, 11], type=pa.int32())
table = pa.Table.from_arrays(
    [column0_array, column1_array], names=["column0", "column1"]
)
print(table)

pq.write_table(table, "example.parquet")

table_read = pq.read_table(
    "example.parquet", schema=pa.schema(list(reversed(table.schema)))
)
print(table_read)

But this does not

import pyarrow as pa
import pyarrow.parquet as pq

nested_column_array = pa.array(
    [{"sub_column0": 0, "sub_column1": 10}, {"sub_column0": 1, "sub_column1": 11}],
    type=pa.struct(
        [pa.field("sub_column0", pa.int32()), pa.field("sub_column1", pa.int32())]
    ),
)
struct_table = pa.Table.from_arrays([nested_column_array], names=["nested_column"])
print(struct_table)

pq.write_table(struct_table, "example2.parquet")

table_read = pq.read_table(
    "example2.parquet",
    schema=pa.schema(
        [
            (
                struct_table.schema[0].name,
                pa.struct(
                    [
                        struct_table.schema[0].type.field(1),
                        struct_table.schema[0].type.field(0),
                    ],
                ),
            )
        ]
    ),
)
print(table_read)
Traceback (most recent call last):
  File "/home/tomnewton/arrow/cpp/src/arrow/compute/example.py", line 30, in <module>
    table_read = pq.read_table(
  File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1843, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1485, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: int32, sub_column1: int32> output fields: struct<sub_column1: int32, sub_column0: int32>

I'm hoping to make both work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant