Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema comparaison in writer #1853

Closed
PierreDubrulle opened this issue Nov 13, 2023 · 3 comments · Fixed by #2209
Closed

Schema comparaison in writer #1853

PierreDubrulle opened this issue Nov 13, 2023 · 3 comments · Fixed by #2209
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@PierreDubrulle
Copy link
Contributor

Environment

Delta-rs version: 0.12.0

Environment:

  • WSL 2
  • Python 3.10

Bug

Schema comparison between the table schema and the schema passed as an argument to the write_deltalake function is too restrictive.
If schema two contains the same columns and values, but in a different order, then the exception ValueError: Schema of data does not match table schema is raised.

Reproducible example

from deltalake import write_deltalake, DeltaTable
import pyarrow as pa

testing_schema1: pa.lib.Schema = pa.schema(
    [
        pa.field("a", pa.int64()),
        pa.field("b", pa.int64()),
        
    ]
)

testing_schema2: pa.lib.Schema = pa.schema(
    [
        pa.field("b", pa.int64()),
        pa.field("a", pa.int64()),
        
    ]
)

df = pd.DataFrame({'b': [4, 5, 6, 7], 'a': [1, 2, 3, 5]})

write_deltalake(table_or_uri="delta.garbage", data=df, mode="append", schema=testing_schema1)
write_deltalake(table_or_uri="delta.garbage", data=df, mode="append", schema=testing_schema2)

Solution

It's important not to make a naive comparison of the two schemas (the one in the table and the one provided for writing). It's better to hash the two by sorting them in order to compare them effectively.
It would also be interesting to be able to expose these hashes via a property of the DeltaTable class.

@PierreDubrulle PierreDubrulle added the bug Something isn't working label Nov 13, 2023
@wjones127
Copy link
Collaborator

I generally think of schemas as being ordered, so I don't consider this a bug. But if you would like write_deltalake() to re-order input data columns to match the table schema, that seems like a reasonable feature request.

@PierreDubrulle
Copy link
Contributor Author

@wjones127
Yes, I want the comparison with the table schema to be done without having to provide a schema in the same order as the table schema.
I'd also like to add a property to retrieve the hash of the table schema

@rtyler rtyler added the binding/python Issues for the Python package label Dec 22, 2023
@antonsteenvoorden
Copy link

antonsteenvoorden commented Jan 30, 2024

I ran into this as well. For our use case the order doesn't matter as long as the schema matches. But now when writing I'm getting this error as well, because the internal data is ordered slightly different.

I think this could be supported by comparing the schemas as sets, when supplying some argument like ignore_order=True

i.e.

            if schema != table.schema().to_pyarrow(
                as_large_types=large_dtypes
            ) and not (mode == "overwrite" and overwrite_schema):
                raise ValueError(
                    "Schema of data does not match table schema\n"
                    f"Data schema:\n{schema}\nTable Schema:\n{table.schema().to_pyarrow(as_large_types=large_dtypes)}"
                )

could be changed to:

            if set(schema) != set(table.schema().to_pyarrow(
                as_large_types=large_dtypes
            )) and not (mode == "overwrite" and overwrite_schema):
                raise ValueError(
                    "Schema of data does not match table schema\n"
                    f"Data schema:\n{schema}\nTable Schema:\n{table.schema().to_pyarrow(as_large_types=large_dtypes)}"
                )

ion-elgreco added a commit that referenced this issue Feb 25, 2024
# Description
Supersedes this PR: #1854,
@PierreDubrulle thanks for pointing it out

# Related Issue(s)
- closes #1853
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
4 participants