Consider supporting schema evolution #1667
Labels
binding/python
Issues for the Python package
binding/rust
Issues for the Rust crate
enhancement
New feature or request
Description
With the current version of
write_deltalake
function, we must setmode="overwrite"
when we useoverwrite_schema=True
.This constraint is explicitly implemented in the source code of this function.
https://github.com/delta-io/delta-rs/blob/main/python/deltalake/writer.py#L181
However, this constraint makes it difficult to use delta-rs for use cases that need to handle schema evolution because
write_deltalake
will entirely delete existing data if mode="overwrite" is set. We cannot "append" data to existing Delta Lake tables with new columns.On the other hand, the Spark API of Delta Lake supports
mergeSchema
in both the overwrite mode and the append mode.https://docs.delta.io/latest/delta-batch.html#automatic-schema-update
It would be beneficial to have an option for handling schema evolution.
mergeSchema
in Spark API, oroverwrite_schema
to allow to useoverwrite_schema=True
with `mode="append".Use Case
We operate an hourly batch for writing Web access event logs into a Delta Lake table. It generally works. However, we sometimes need to add data fields to the event logs.
For now, we cannot handle such schema evolution with delta-rs. We need to use Spark instead to update the schema of the Delta Lake table.
Related Issue(s)
This is not an issue, but we recently found a workaround for appending data to an existing Delta Lake table with new columns.
This workaround works but (1) Delta Lake table must be partitioned, and (2) honestly, not straightforward.
The text was updated successfully, but these errors were encountered: