Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting schema evolution #1667

Closed
sugibuchi opened this issue Sep 26, 2023 · 1 comment
Closed

Consider supporting schema evolution #1667

sugibuchi opened this issue Sep 26, 2023 · 1 comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate enhancement New feature or request

Comments

@sugibuchi
Copy link

sugibuchi commented Sep 26, 2023

Description

With the current version of write_deltalake function, we must set mode="overwrite" when we use overwrite_schema=True.

This constraint is explicitly implemented in the source code of this function.

https://github.com/delta-io/delta-rs/blob/main/python/deltalake/writer.py#L181

However, this constraint makes it difficult to use delta-rs for use cases that need to handle schema evolution because write_deltalake will entirely delete existing data if mode="overwrite" is set. We cannot "append" data to existing Delta Lake tables with new columns.

On the other hand, the Spark API of Delta Lake supports mergeSchema in both the overwrite mode and the append mode.

https://docs.delta.io/latest/delta-batch.html#automatic-schema-update

Delta Lake can automatically update the schema of a table as part of a DML transaction (either appending or overwriting), and make the schema compatible with the data being written.

It would be beneficial to have an option for handling schema evolution.

  • Add a new write option equivalent to mergeSchema in Spark API, or
  • Relax constraints on overwrite_schema to allow to use overwrite_schema=True with `mode="append".

Use Case

We operate an hourly batch for writing Web access event logs into a Delta Lake table. It generally works. However, we sometimes need to add data fields to the event logs.

For now, we cannot handle such schema evolution with delta-rs. We need to use Spark instead to update the schema of the Delta Lake table.

Related Issue(s)

This is not an issue, but we recently found a workaround for appending data to an existing Delta Lake table with new columns.

# Schema update: Write an empty PyArrow table with "overwrite_schema=True", "mode="overwrite"",
# and "partition_filters" that does not match any partitions
write_deltalake(
  "event_log",
  py_table.schema.empty_table(),   # Create an empty table from PyArrow table to write
  mode="overwrite",
  overwrite_schema=True,
  partition_by=['event_date'],
  partition_filters = [("event_date", "=", "fake_date")])
)

# Then we can safely append data
write_deltalake(
  "event_log",
  py_table,
  partition_by=['event_date'],
  mode="append",
)

This workaround works but (1) Delta Lake table must be partitioned, and (2) honestly, not straightforward.

@sugibuchi sugibuchi added the enhancement New feature or request label Sep 26, 2023
@ion-elgreco ion-elgreco added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Nov 22, 2023
@ion-elgreco
Copy link
Collaborator

Closing this, since it's duplicate of: #1386

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants