Consider supporting schema evolution #1667

sugibuchi · 2023-09-26T18:13:18Z

Description

With the current version of write_deltalake function, we must set mode="overwrite" when we use overwrite_schema=True.

This constraint is explicitly implemented in the source code of this function.

https://github.com/delta-io/delta-rs/blob/main/python/deltalake/writer.py#L181

However, this constraint makes it difficult to use delta-rs for use cases that need to handle schema evolution because write_deltalake will entirely delete existing data if mode="overwrite" is set. We cannot "append" data to existing Delta Lake tables with new columns.

On the other hand, the Spark API of Delta Lake supports mergeSchema in both the overwrite mode and the append mode.

https://docs.delta.io/latest/delta-batch.html#automatic-schema-update

Delta Lake can automatically update the schema of a table as part of a DML transaction (either appending or overwriting), and make the schema compatible with the data being written.

It would be beneficial to have an option for handling schema evolution.

Add a new write option equivalent to mergeSchema in Spark API, or
Relax constraints on overwrite_schema to allow to use overwrite_schema=True with `mode="append".

Use Case

We operate an hourly batch for writing Web access event logs into a Delta Lake table. It generally works. However, we sometimes need to add data fields to the event logs.

For now, we cannot handle such schema evolution with delta-rs. We need to use Spark instead to update the schema of the Delta Lake table.

Related Issue(s)

This is not an issue, but we recently found a workaround for appending data to an existing Delta Lake table with new columns.

# Schema update: Write an empty PyArrow table with "overwrite_schema=True", "mode="overwrite"",
# and "partition_filters" that does not match any partitions
write_deltalake(
  "event_log",
  py_table.schema.empty_table(),   # Create an empty table from PyArrow table to write
  mode="overwrite",
  overwrite_schema=True,
  partition_by=['event_date'],
  partition_filters = [("event_date", "=", "fake_date")])
)

# Then we can safely append data
write_deltalake(
  "event_log",
  py_table,
  partition_by=['event_date'],
  mode="append",
)

This workaround works but (1) Delta Lake table must be partitioned, and (2) honestly, not straightforward.

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2023-11-22T18:05:13Z

Closing this, since it's duplicate of: #1386

sugibuchi added the enhancement New feature or request label Sep 26, 2023

ion-elgreco added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Nov 22, 2023

ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider supporting schema evolution #1667

Consider supporting schema evolution #1667

sugibuchi commented Sep 26, 2023 •

edited

Loading

ion-elgreco commented Nov 22, 2023

Consider supporting schema evolution #1667

Consider supporting schema evolution #1667

Comments

sugibuchi commented Sep 26, 2023 • edited Loading

Description

ion-elgreco commented Nov 22, 2023

sugibuchi commented Sep 26, 2023 •

edited

Loading