Pyarrow dataset scan can't handle schema evolved nested structs #1610

Tom-Newton · 2023-08-30T12:13:22Z

Environment

Delta-rs version: 0.10.1

Binding: Python

Environment:

Cloud provider: Azure
OS: Ubuntu 20.04
Other: pyarrow 13.0.0

Bug

What happened:
Exception is raised when trying to read a delta table that has undergone a schema evolution in a nested column. The problem is that the underlying parquet file is missing a column that now exists in the schema according to the delta transaction log.

Traceback (most recent call last):                                              
  File "minimal_reproduce.py", line 36, in <module>
    pyarrow_table = deltalake.DeltaTable(temp_path).to_pyarrow_table()
  File "/home/tomnewton/.local/lib/python3.8/site-packages/deltalake/table.py", line 529, in to_pyarrow_table
    return self.to_pyarrow_dataset(
  File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: string> output fields: struct<sub_column0: string, sub_column1: string>

What you expected to happen:
Read the table without error. The new column that is not actually present in the underlying parquet files should be filled with nulls.

How to reproduce it:
I wrote a python script that reproduces it. Essentially this just writes to a delta table, makes a second write with schema evolution in a sub column then tries to read the resulting table. (zipped it because github doesn't allow uploading .py files)
minimal_reproduce.zip

More details:
This might need to be fixed in pyarrow but this slightly unusual usecase is motivated by delta-rs.

The text was updated successfully, but these errors were encountered:

r3stl355 · 2023-10-06T15:43:56Z

This is effectively a schema evolution when column data type is changing, which is not currently supported by Delta, right?

Error is thrown in the pyarrow kernel, as it expects all the input struct fields to be present in the output which is not possible of some rows have missing struct fields.

https://github.com/apache/arrow/blob/fdecb6a0bee5fb482705de14c161853fe2ea2b41/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L368

Tom-Newton · 2023-10-06T15:50:48Z

@r3stl355 I'm pretty sure this kind of schema evolution is supported by delta. If you try my reproduction script which writes the delta table using spark it allows this schema evolution and it can also reads it back correctly. Only when trying to read with delta-rs + pyarrow does it break.

r3stl355 · 2023-10-06T16:28:48Z

You are right @Tom-Newton, my bad, I should've checked the other implementation before saying above.

I think the problem is still with the way pyarrow work, not necessarily delta-rs (unless I missing something else). There is this test with TODO - may be related: https://github.com/apache/arrow/blob/fdecb6a0bee5fb482705de14c161853fe2ea2b41/cpp/src/arrow/dataset/scanner_test.cc#L2552

CastStruct (which I presume is throwing the error) is processing input in batches (I assume batch == file in this case) so it reads the input struct fields from the batch and matches it against some output struct which, as I see it, is the same for all the batches, so you have 2 options to specifify that ouput struct - it's either a union of all the fields (as it seems to be the case right now) and throws an error or it could be an intersection of fields in which case you would end up with subset of nested columns (or no data in your case).

Tom-Newton · 2023-10-06T16:39:36Z

I agree the problem may need to be resolved inside pyarrow but I think there is a chance it could be worked around in delta-rs. Regardless, I chose to create the issue here because delta-rs motivates this currently unsupported use-case and I thought it was more likely to be solved if I raised it here.

pyarrow is fine with adding entirely new columns through schema evolution just not sub fields of struct columns. That TODO you mention seems relevant to me.

Tom-Newton · 2023-10-06T16:48:48Z

Actually it looks like that TODO references a completed ticket apache/arrow#31101. The example on that issue actually goes the opposite direction to what I want to resolve this problem. Also it looks like a new issue has been opened asking to implement the same functionality for pyarrow apache/arrow#35408. I'll need to investigate a bit more when I get a chance.

r3stl355 · 2023-10-06T18:03:50Z

Yeah, those may not be exactly the same as your issue. In addition to this, looks like Rust version is also suffering from the same as it's relying on Arrow DataFusion:

export TABLE_URI="<you table path here>"

cargo run --example load_table --features="datafusion"

results in Error during planning: Cannot cast file schema field nested_column of type Struct([Field { name: \"sub_column0\.. }]) to table schema field of type Struct([Field { name: \"sub_column1\"....

leoeareis · 2024-03-18T12:41:14Z

Hey @r3stl355 and @Tom-Newton! I faced the same problem right now. Do you have any updates about that?

Tom-Newton · 2024-03-19T01:07:16Z

I'm not aware of any progress on this. It's quite a rare occurrence for us and we've been working around it by re-writing relevant parts of tables, when we have problems.

ion-elgreco · 2024-08-19T20:03:32Z

@Tom-Newton did you create an issue in the arrow repo?

Tom-Newton · 2024-08-19T20:06:44Z

@Tom-Newton did you create an issue in the arrow repo?

I did not. I never really explored whether it would be possible to fix in delta-rs.

ion-elgreco · 2024-08-19T20:16:26Z

@Tom-Newton did you create an issue in the arrow repo?

I did not. I never really explored whether it would be possible to fix in delta-rs.

I don't believe there is much we can do since the issue lies in the pyarrow scan function.

…ct (#44587) ### Rationale for this change Sometimes its useful to add a column full of nulls in a cast. Currently this is supported in top level columns but not inside structs. Example where this is important: delta-io/delta-rs#1610 ### What changes are included in this PR? Add support for filling in columns full of null for nullable struct fields. I've gone for a fairly minimal change that achieves what I needed but I wonder if there should be a more significant change so that this casting is done by field name and ignore the field order. ### Are these changes tested? Yes. The expected behaviour in several existing tests has been altered and a couple of new tests have been added. I also rolled out a custom build with this change internally because it suddenly became a critical problem. ### Are there any user-facing changes? Yes. There are scenarios that previously failed with `struct fields don't match or are in the wrong order` but now succeed after filling in `null`s. * GitHub Issue: #44555 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>

Tom-Newton · 2024-11-26T21:27:57Z

Going to close this because apache/arrow#44587 merged. It missed the 18.1.0 release though so, I guess we will need to wait for 19.0.0 for it to be included in an official release (personally I'm using a custom build of pyarrow which includes this fix).

Tom-Newton added the bug Something isn't working label Aug 30, 2023

ion-elgreco changed the title ~~Fix reading tables which have undergone schema evolution in nested columns~~ Pyarrow dataset scan can't handle schema evolved nested structs Aug 19, 2024

This was referenced Oct 28, 2024

[C++][Compute] Support casting a smaller struct to a bigger one if extra fields are nullable apache/arrow#44555

Closed

GH-44555: [C++][Compute] Allow casting struct to bigger nullable struct apache/arrow#44587

Merged

Tom-Newton closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyarrow dataset scan can't handle schema evolved nested structs #1610

Pyarrow dataset scan can't handle schema evolved nested structs #1610

Tom-Newton commented Aug 30, 2023 •

edited

Loading

r3stl355 commented Oct 6, 2023

Tom-Newton commented Oct 6, 2023 •

edited

Loading

r3stl355 commented Oct 6, 2023 •

edited

Loading

Tom-Newton commented Oct 6, 2023

Tom-Newton commented Oct 6, 2023 •

edited

Loading

r3stl355 commented Oct 6, 2023

leoeareis commented Mar 18, 2024 •

edited

Loading

Tom-Newton commented Mar 19, 2024

ion-elgreco commented Aug 19, 2024

Tom-Newton commented Aug 19, 2024

ion-elgreco commented Aug 19, 2024

Tom-Newton commented Nov 26, 2024

Pyarrow dataset scan can't handle schema evolved nested structs #1610

Pyarrow dataset scan can't handle schema evolved nested structs #1610

Comments

Tom-Newton commented Aug 30, 2023 • edited Loading

Environment

Bug

r3stl355 commented Oct 6, 2023

Tom-Newton commented Oct 6, 2023 • edited Loading

r3stl355 commented Oct 6, 2023 • edited Loading

Tom-Newton commented Oct 6, 2023

Tom-Newton commented Oct 6, 2023 • edited Loading

r3stl355 commented Oct 6, 2023

leoeareis commented Mar 18, 2024 • edited Loading

Tom-Newton commented Mar 19, 2024

ion-elgreco commented Aug 19, 2024

Tom-Newton commented Aug 19, 2024

ion-elgreco commented Aug 19, 2024

Tom-Newton commented Nov 26, 2024

Tom-Newton commented Aug 30, 2023 •

edited

Loading

Tom-Newton commented Oct 6, 2023 •

edited

Loading

r3stl355 commented Oct 6, 2023 •

edited

Loading

Tom-Newton commented Oct 6, 2023 •

edited

Loading

leoeareis commented Mar 18, 2024 •

edited

Loading