Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow dataset scan can't handle schema evolved nested structs #1610

Closed
Tom-Newton opened this issue Aug 30, 2023 · 12 comments
Closed

Pyarrow dataset scan can't handle schema evolved nested structs #1610

Tom-Newton opened this issue Aug 30, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Aug 30, 2023

Environment

Delta-rs version: 0.10.1

Binding: Python

Environment:

  • Cloud provider: Azure
  • OS: Ubuntu 20.04
  • Other: pyarrow 13.0.0

Bug

What happened:
Exception is raised when trying to read a delta table that has undergone a schema evolution in a nested column. The problem is that the underlying parquet file is missing a column that now exists in the schema according to the delta transaction log.

Traceback (most recent call last):                                              
  File "minimal_reproduce.py", line 36, in <module>
    pyarrow_table = deltalake.DeltaTable(temp_path).to_pyarrow_table()
  File "/home/tomnewton/.local/lib/python3.8/site-packages/deltalake/table.py", line 529, in to_pyarrow_table
    return self.to_pyarrow_dataset(
  File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: string> output fields: struct<sub_column0: string, sub_column1: string>

What you expected to happen:
Read the table without error. The new column that is not actually present in the underlying parquet files should be filled with nulls.

How to reproduce it:
I wrote a python script that reproduces it. Essentially this just writes to a delta table, makes a second write with schema evolution in a sub column then tries to read the resulting table. (zipped it because github doesn't allow uploading .py files)
minimal_reproduce.zip

More details:
This might need to be fixed in pyarrow but this slightly unusual usecase is motivated by delta-rs.

@Tom-Newton Tom-Newton added the bug Something isn't working label Aug 30, 2023
@r3stl355
Copy link
Contributor

r3stl355 commented Oct 6, 2023

This is effectively a schema evolution when column data type is changing, which is not currently supported by Delta, right?

Error is thrown in the pyarrow kernel, as it expects all the input struct fields to be present in the output which is not possible of some rows have missing struct fields.

https://github.com/apache/arrow/blob/fdecb6a0bee5fb482705de14c161853fe2ea2b41/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L368

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Oct 6, 2023

@r3stl355 I'm pretty sure this kind of schema evolution is supported by delta. If you try my reproduction script which writes the delta table using spark it allows this schema evolution and it can also reads it back correctly. Only when trying to read with delta-rs + pyarrow does it break.

@r3stl355
Copy link
Contributor

r3stl355 commented Oct 6, 2023

You are right @Tom-Newton, my bad, I should've checked the other implementation before saying above.

I think the problem is still with the way pyarrow work, not necessarily delta-rs (unless I missing something else). There is this test with TODO - may be related: https://github.com/apache/arrow/blob/fdecb6a0bee5fb482705de14c161853fe2ea2b41/cpp/src/arrow/dataset/scanner_test.cc#L2552

CastStruct (which I presume is throwing the error) is processing input in batches (I assume batch == file in this case) so it reads the input struct fields from the batch and matches it against some output struct which, as I see it, is the same for all the batches, so you have 2 options to specifify that ouput struct - it's either a union of all the fields (as it seems to be the case right now) and throws an error or it could be an intersection of fields in which case you would end up with subset of nested columns (or no data in your case).

@Tom-Newton
Copy link
Contributor Author

I agree the problem may need to be resolved inside pyarrow but I think there is a chance it could be worked around in delta-rs. Regardless, I chose to create the issue here because delta-rs motivates this currently unsupported use-case and I thought it was more likely to be solved if I raised it here.

pyarrow is fine with adding entirely new columns through schema evolution just not sub fields of struct columns. That TODO you mention seems relevant to me.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Oct 6, 2023

Actually it looks like that TODO references a completed ticket apache/arrow#31101. The example on that issue actually goes the opposite direction to what I want to resolve this problem. Also it looks like a new issue has been opened asking to implement the same functionality for pyarrow apache/arrow#35408. I'll need to investigate a bit more when I get a chance.

@r3stl355
Copy link
Contributor

r3stl355 commented Oct 6, 2023

Yeah, those may not be exactly the same as your issue. In addition to this, looks like Rust version is also suffering from the same as it's relying on Arrow DataFusion:

export TABLE_URI="<you table path here>"

cargo run --example load_table --features="datafusion"

results in Error during planning: Cannot cast file schema field nested_column of type Struct([Field { name: \"sub_column0\.. }]) to table schema field of type Struct([Field { name: \"sub_column1\"....

@leoeareis
Copy link

leoeareis commented Mar 18, 2024

Hey @r3stl355 and @Tom-Newton! I faced the same problem right now. Do you have any updates about that?

@Tom-Newton
Copy link
Contributor Author

I'm not aware of any progress on this. It's quite a rare occurrence for us and we've been working around it by re-writing relevant parts of tables, when we have problems.

@ion-elgreco ion-elgreco changed the title Fix reading tables which have undergone schema evolution in nested columns Pyarrow dataset scan can't handle schema evolved nested structs Aug 19, 2024
@ion-elgreco
Copy link
Collaborator

@Tom-Newton did you create an issue in the arrow repo?

@Tom-Newton
Copy link
Contributor Author

@Tom-Newton did you create an issue in the arrow repo?

I did not. I never really explored whether it would be possible to fix in delta-rs.

@ion-elgreco
Copy link
Collaborator

@Tom-Newton did you create an issue in the arrow repo?

I did not. I never really explored whether it would be possible to fix in delta-rs.

I don't believe there is much we can do since the issue lies in the pyarrow scan function.

lidavidm added a commit to apache/arrow that referenced this issue Nov 25, 2024
…ct (#44587)

### Rationale for this change
Sometimes its useful to add a column full of nulls in a cast. Currently this is supported in top level columns but not inside structs. Example where this is important: delta-io/delta-rs#1610

### What changes are included in this PR?
Add support for filling in columns full of null for nullable struct fields. 
I've gone for a fairly minimal change that achieves what I needed but I wonder if there should be a more significant change so that this casting is done by field name and ignore the field order. 

### Are these changes tested?
Yes. The expected behaviour in several existing tests has been altered and a couple of new tests have been added.

I also rolled out a custom build with this change internally because it suddenly became a critical problem. 

### Are there any user-facing changes?
Yes. There are scenarios that previously failed with `struct fields don't match or are in the wrong order` but now succeed after filling in `null`s.

* GitHub Issue: #44555

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
@Tom-Newton
Copy link
Contributor Author

Going to close this because apache/arrow#44587 merged. It missed the 18.1.0 release though so, I guess we will need to wait for 19.0.0 for it to be included in an official release (personally I'm using a custom build of pyarrow which includes this fix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants