Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot merge to a table with a timestamp column after upgrading delta-rs #2478

Closed
echai58 opened this issue May 3, 2024 · 4 comments
Closed
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@echai58
Copy link

echai58 commented May 3, 2024

Environment

Delta-rs version: 0.15.3 => 0.17.2

Binding: python


Bug

What happened:
I have a lot of tables created via delta-rs from an older version, around 0.15.3. I'm looking to upgrade, but I'm running into breaking errors, I think from the 0.16.0 breaking change to how timestamps work.

I have tables that have a timestamp column, and due to older versions of delta-rs, they are written without a timezone. When I upgraded to 0.17.2, and I read in the table, the schema says timestamp[us, tz=UTC] - okay, that's reasonable, it's attaching a UTC timezone.

But, when I try to write to the table via a merge and the predicate compares the timestamp columns, it fails with

DeltaError: Generic DeltaTable error: External error: Arrow error: Invalid argument error: Invalid comparison operation: Timestamp(Microsecond, Some("UTC")) == Timestamp(Microsecond, None)

My theory is that delta-rs is interpreting the schema of the table as having a UTC timezone, whereas the physical data actually reflects the table having no timezone. Thus, when it casts the source table I pass in, it applies the UTC timezone, which Arrow then cannot compare against the physical table, which has no timezone.

What you expected to happen:
Is there any way to upgrade delta-rs such that this doesn't break? Would be really unfortunate to have to rewrite all of these tables because of this breaking change.

How to reproduce it:
Run this with an older (< 0.16) delta-rs:

t = pa.Table.from_pandas(pd.DataFrame({"p": [1],"a": [pd.Timestamp("2022-01-01")], "b": [1], }), schema=pa.schema([("p", pa.int64()), ("a", pa.timestamp("us", tz="UTC")), ("b", pa.int64())]))

write_deltalake(
    "test",
    t,
    mode="overwrite",
    partition_by=["p"],
)

Then, run this with a new delta-rs:

new_data = pa.Table.from_pandas(pd.DataFrame({"p": [1],"a": [pd.Timestamp("2022-01-01")], "b": [4]}), schema=pa.schema([("p", pa.int64()), ("a", pa.timestamp("us", tz="UTC")), ("b", pa.int64())]))

dt.merge(
    source=new_data,
    predicate="s.p = t.p and s.a = t.a",
    source_alias="s",
    target_alias="t",
).when_matched_update_all().when_not_matched_insert_all().execute()
@echai58 echai58 added the bug Something isn't working label May 3, 2024
@echai58
Copy link
Author

echai58 commented May 3, 2024

@ion-elgreco would be interested to hear your thoughts on this, seems related to this #2341

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented May 3, 2024

@echai58 it was quite a breaking change.

Old tables were never written correctly.

Ideally we cast the table scan to use the delta schema, but we have never added this. You could look into that or just rewrite the tables

@echai58
Copy link
Author

echai58 commented May 3, 2024

@echai58 it was quite a breaking change.

Old tables were never written correctly.

Ideally we cast the table scan to use the delta schema, but we have never added this. You could look into that or just rewrite the tables

@ion-elgreco Okay, it would be really difficult for us to rewrite all of our tables. Can you provide some pointers to code locations that we'd need to change to get the table scan to use the delta schema? I can look into that.

@ion-elgreco
Copy link
Collaborator

@echai58 it was quite a breaking change.

Old tables were never written correctly.

Ideally we cast the table scan to use the delta schema, but we have never added this. You could look into that or just rewrite the tables

@ion-elgreco Okay, it would be really difficult for us to rewrite all of our tables. Can you provide some pointers to code locations that we'd need to change to get the table scan to use the delta schema? I can look into that.

Somewhere in the scan builder I guess, and then on the parquet scan

@rtyler rtyler added the binding/python Issues for the Python package label May 15, 2024
ion-elgreco pushed a commit that referenced this issue Jul 21, 2024
# Description

By casting the read record batch to the delta schema datafusion can read
tables where the underlying parquet files can be cast to the desired
schema. Fixes:

- Errors querying data where some of the parquet files may not have
columns that were added later because of schema migration. This includes
nested columns for structs that are in Maps, Lists, or children of other
structs
- maps and lists written with different different element names
- timestamps of different units.
- Any other cast supported by arrow-cast.

This can be done now since data-fusion exposes a SchemaAdapter which can
be overwritten.

We should note that this makes all times being read by delta-rs as
having microsecond precision to match the Delta protocol.

# Related Issue(s)
- This makes solving #2478 and #2341 just a matter of adding code to
delta-rs cast.

---------

Co-authored-by: Alex Wilcoxson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants