[C++][Compute] Support casting struct fields to a different order #45028

Tom-Newton · 2024-12-14T14:32:10Z

Describe the enhancement requested

Arrow can already cast to a different order columns but it can't do the same on struct fields.

I previously completed a related issue #44555 where it was agreed

in general if it works on tables but not structs (or vice versa) we should make them both work.

Therefore by the same logic I think adding support for re-ordering columns is also reasonable.

Again my personal motivation is related to https://github.com/delta-io/delta-rs/. We want to be able to read any table and Spark sometimes writes parquet files with different physical schema field order compared to the order in the table's schema.

Component(s)

C++

Tom-Newton · 2024-12-16T15:01:28Z

To clarify a bit what I'm talking about

This works

import pyarrow as pa
import pyarrow.parquet as pq

column0_array = pa.array([0, 1], type=pa.int32())
column1_array = pa.array([10, 11], type=pa.int32())
table = pa.Table.from_arrays(
    [column0_array, column1_array], names=["column0", "column1"]
)
print(table)

pq.write_table(table, "example.parquet")

table_read = pq.read_table(
    "example.parquet", schema=pa.schema(list(reversed(table.schema)))
)
print(table_read)

But this does not

import pyarrow as pa
import pyarrow.parquet as pq

nested_column_array = pa.array(
    [{"sub_column0": 0, "sub_column1": 10}, {"sub_column0": 1, "sub_column1": 11}],
    type=pa.struct(
        [pa.field("sub_column0", pa.int32()), pa.field("sub_column1", pa.int32())]
    ),
)
struct_table = pa.Table.from_arrays([nested_column_array], names=["nested_column"])
print(struct_table)

pq.write_table(struct_table, "example2.parquet")

table_read = pq.read_table(
    "example2.parquet",
    schema=pa.schema(
        [
            (
                struct_table.schema[0].name,
                pa.struct(
                    [
                        struct_table.schema[0].type.field(1),
                        struct_table.schema[0].type.field(0),
                    ],
                ),
            )
        ]
    ),
)
print(table_read)

Traceback (most recent call last):
  File "/home/tomnewton/arrow/cpp/src/arrow/compute/example.py", line 30, in <module>
    table_read = pq.read_table(
  File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1843, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1485, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: int32, sub_column1: int32> output fields: struct<sub_column1: int32, sub_column0: int32>

I'm hoping to make both work.

Tom-Newton added the Type: enhancement label Dec 14, 2024

github-actions bot added the Component: C++ label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Compute] Support casting struct fields to a different order #45028

[C++][Compute] Support casting struct fields to a different order #45028

Tom-Newton commented Dec 14, 2024 •

edited

Loading

Tom-Newton commented Dec 16, 2024

[C++][Compute] Support casting struct fields to a different order #45028

[C++][Compute] Support casting struct fields to a different order #45028

Comments

Tom-Newton commented Dec 14, 2024 • edited Loading

Describe the enhancement requested

Component(s)

Tom-Newton commented Dec 16, 2024

Tom-Newton commented Dec 14, 2024 •

edited

Loading