-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema issue when writing new delta tables - parquet schema not valid delta lake schema #9795
Comments
I ran into the same issue today, I made an upstream issue in delta-rs repo: delta-io/delta-rs#1528 |
I thought about posting an issue in delta-rs as well, but I thought I saw some issues there about adding support for arrow I also felt like it is the duty of the application writing the data to ensure schema consistency on read/write. The delta transaction protocol doesn't distinguish between I'm not familiar enough with the delta-rs implementation, but perhaps there is a solution in which delta-rs requires an explicit schema when translating a delta table to arrow format, so that the |
I have encountered the same issue. I wrote a delta-table first in s3 with the following params data_to_write.write_delta(
target=s3_location,
mode="error",
storage_options={
"AWS_REGION": self.region_name,
"AWS_ACCESS_KEY_ID": self.boto_session.get_credentials().access_key,
"AWS_SECRET_ACCESS_KEY": self.boto_session.get_credentials().secret_key,
"AWS_S3_ALLOW_UNSAFE_RENAME": "true",
},
overwrite_schema=True,
delta_write_options={
"partition_by": [
"ingested_at_year",
"ingested_at_month",
"ingested_at_day",
"ingested_at_hour",
],
"name":"raw_events",
"description":"Events loaded from source bucket",
},
) On the next run, it fails due to the following error E ValueError: Schema of data does not match table schema
E Table schema:
E obj_key: large_string
E data: large_string
E ingested_at: timestamp[us, tz=UTC]
E ingested_at_year: int32
E ingested_at_month: uint32
E ingested_at_day: uint32
E ingested_at_hour: uint32
E ingested_at_minute: uint32
E ingested_at_second: uint32
E Data Schema:
E obj_key: string
E data: string
E ingested_at: timestamp[us]
E ingested_at_year: int32
E ingested_at_month: int32
E ingested_at_day: int32
E ingested_at_hour: int32
E ingested_at_minute: int32
E ingested_at_second: int32 No possible solution I've found |
@philszep I think you can close it here. It's going to be fixed upstream. |
Same issue here btw, Do you know when it will be fixed upstram? |
@edgBR Actually this is a different issue. Can you create one upstream? Then I will look at it, its probably a trivial fix. |
Ran into a similar issue implementing write support for Iceberg (#15018) Example to reproduce:
Dataframe schema:
Arrow schema:
|
@kevinjqliu I resolved it upstream in delta-rs, with the large_dtypes parameter |
Thanks @ion-elgreco I'll take a look at Iceberg's schema handling |
@kevinjqliu actually I may even be able to let go of this parameter in delta-rs if I just always convert to lower for schema check :p |
there seems to be 2 issues to me
Looks like in PyIceberg, we're casting I have no idea why Polars defaults to |
Polars isn't changing from string to large_string when it converts to arrow. It doesn't use |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Outputs a
DeltaError
:In this case, if you look at the delta table, it has two parquet files. In the first parquet file the
this
field is of typelarge_string
whereas in the second thethis
field is of typestring
.Issue description
There is an invalid schema generated when creating a new delta table. This has to do with delta lake not distinguishing between arrow datatypes
Utf8
andLargeUtf8
.I believe this is caused by these lines 3307-3314 of frame.py. See pull request #7616.
There, it relies on an existing table to fix the schema to be consistent with a delta table schema. To remedy this we can cast the existing
data.schema
object to a deltalake schema object and back, for example, I think if we replace the code in frame.py referenced above with:then the problem will be resolved for any table that is created.
Expected behavior
New delta table created with valid deltalake schema.
Installed versions
The text was updated successfully, but these errors were encountered: