-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema comparaison in writer #1853
Comments
I generally think of schemas as being ordered, so I don't consider this a bug. But if you would like |
@wjones127 |
I ran into this as well. For our use case the order doesn't matter as long as the schema matches. But now when writing I'm getting this error as well, because the internal data is ordered slightly different. I think this could be supported by comparing the schemas as sets, when supplying some argument like i.e. if schema != table.schema().to_pyarrow(
as_large_types=large_dtypes
) and not (mode == "overwrite" and overwrite_schema):
raise ValueError(
"Schema of data does not match table schema\n"
f"Data schema:\n{schema}\nTable Schema:\n{table.schema().to_pyarrow(as_large_types=large_dtypes)}"
) could be changed to: if set(schema) != set(table.schema().to_pyarrow(
as_large_types=large_dtypes
)) and not (mode == "overwrite" and overwrite_schema):
raise ValueError(
"Schema of data does not match table schema\n"
f"Data schema:\n{schema}\nTable Schema:\n{table.schema().to_pyarrow(as_large_types=large_dtypes)}"
) |
# Description Supersedes this PR: #1854, @PierreDubrulle thanks for pointing it out # Related Issue(s) - closes #1853
Environment
Delta-rs version: 0.12.0
Environment:
Bug
Schema comparison between the table schema and the schema passed as an argument to the write_deltalake function is too restrictive.
If schema two contains the same columns and values, but in a different order, then the exception ValueError: Schema of data does not match table schema is raised.
Reproducible example
Solution
It's important not to make a naive comparison of the two schemas (the one in the table and the one provided for writing). It's better to hash the two by sorting them in order to compare them effectively.
It would also be interesting to be able to expose these hashes via a property of the DeltaTable class.
The text was updated successfully, but these errors were encountered: