-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyarrow dataset scan can't handle schema evolved nested structs #1610
Comments
This is effectively a schema evolution when column data type is changing, which is not currently supported by Delta, right? Error is thrown in the pyarrow kernel, as it expects all the input struct fields to be present in the output which is not possible of some rows have missing struct fields. |
@r3stl355 I'm pretty sure this kind of schema evolution is supported by delta. If you try my reproduction script which writes the delta table using spark it allows this schema evolution and it can also reads it back correctly. Only when trying to read with delta-rs + pyarrow does it break. |
You are right @Tom-Newton, my bad, I should've checked the other implementation before saying above. I think the problem is still with the way pyarrow work, not necessarily delta-rs (unless I missing something else). There is this test with TODO - may be related: https://github.com/apache/arrow/blob/fdecb6a0bee5fb482705de14c161853fe2ea2b41/cpp/src/arrow/dataset/scanner_test.cc#L2552 CastStruct (which I presume is throwing the error) is processing input in batches (I assume batch == file in this case) so it reads the input struct fields from the batch and matches it against some output struct which, as I see it, is the same for all the batches, so you have 2 options to specifify that ouput struct - it's either a union of all the fields (as it seems to be the case right now) and throws an error or it could be an intersection of fields in which case you would end up with subset of nested columns (or no data in your case). |
I agree the problem may need to be resolved inside
|
Actually it looks like that TODO references a completed ticket apache/arrow#31101. The example on that issue actually goes the opposite direction to what I want to resolve this problem. Also it looks like a new issue has been opened asking to implement the same functionality for pyarrow apache/arrow#35408. I'll need to investigate a bit more when I get a chance. |
Yeah, those may not be exactly the same as your issue. In addition to this, looks like Rust version is also suffering from the same as it's relying on Arrow DataFusion:
results in |
Hey @r3stl355 and @Tom-Newton! I faced the same problem right now. Do you have any updates about that? |
I'm not aware of any progress on this. It's quite a rare occurrence for us and we've been working around it by re-writing relevant parts of tables, when we have problems. |
@Tom-Newton did you create an issue in the arrow repo? |
I did not. I never really explored whether it would be possible to fix in |
I don't believe there is much we can do since the issue lies in the pyarrow scan function. |
…ct (#44587) ### Rationale for this change Sometimes its useful to add a column full of nulls in a cast. Currently this is supported in top level columns but not inside structs. Example where this is important: delta-io/delta-rs#1610 ### What changes are included in this PR? Add support for filling in columns full of null for nullable struct fields. I've gone for a fairly minimal change that achieves what I needed but I wonder if there should be a more significant change so that this casting is done by field name and ignore the field order. ### Are these changes tested? Yes. The expected behaviour in several existing tests has been altered and a couple of new tests have been added. I also rolled out a custom build with this change internally because it suddenly became a critical problem. ### Are there any user-facing changes? Yes. There are scenarios that previously failed with `struct fields don't match or are in the wrong order` but now succeed after filling in `null`s. * GitHub Issue: #44555 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
Going to close this because apache/arrow#44587 merged. It missed the 18.1.0 release though so, I guess we will need to wait for 19.0.0 for it to be included in an official release (personally I'm using a custom build of |
Environment
Delta-rs version: 0.10.1
Binding: Python
Environment:
Bug
What happened:
Exception is raised when trying to read a delta table that has undergone a schema evolution in a nested column. The problem is that the underlying parquet file is missing a column that now exists in the schema according to the delta transaction log.
What you expected to happen:
Read the table without error. The new column that is not actually present in the underlying parquet files should be filled with nulls.
How to reproduce it:
I wrote a python script that reproduces it. Essentially this just writes to a delta table, makes a second write with schema evolution in a sub column then tries to read the resulting table. (zipped it because github doesn't allow uploading
.py
files)minimal_reproduce.zip
More details:
This might need to be fixed in
pyarrow
but this slightly unusual usecase is motivated by delta-rs.The text was updated successfully, but these errors were encountered: