Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove unnecessary metadata action when overwriting partition #2923

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

PeterKeDer
Copy link
Contributor

Description

Fixes spurious metadata action when write_deltalake is called with mode overwrite, using a predicate and with a string partition column. This is undesirable because all concurrent writes will fail and need to be retried due to this metadata action.

This is caused by the schema != table_schema check. table_schema is from calling table.input_schema() which converts string partition columns to dictionary and causes it to be different from schema.

We fix this issue by comparing it using try_cast_batch, so the behavior becomes identical to writing with mode='append'.

To replicate (on deltalake==0.20.1):

from deltalake import DeltaTable, write_deltalake
import polars as pl

df1 = pl.DataFrame({'id': ['a', 'b'], 'val': [1,2]})

write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])
write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])
write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])

If we look at the latter 2 transaction JSONs, they will have a metadata action indicating a schema change, even though the schema is identical.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Oct 5, 2024
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a test as follow up.

@houqp houqp enabled auto-merge October 5, 2024 04:35
@@ -1075,7 +1075,7 @@ impl std::future::IntoFuture for WriteBuilder {
actions.push(protocol.into())
}

if schema != table_schema {
if try_cast_batch(schema.fields(), table_schema.fields()).is_err() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try cast batch is too lenient, you can already see it in the test failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple issues that are worth digging into:

  • can_cast_types seems like the wrong function to use in try_cast_batch considering it allows incompatible type downcasting, i.e. casting from i64 to i8.
  • The failing rust test test_issue_2105 has the id column type defined as PrimitiveType::Integer, which based on delta spec should map to 4 bytes int (i32). But the test expects the query result to have the type ArrowDataType::Int64. Is this expected? The original MR that introduced this test has the expected type defined as ArrowDataType::Int32, which seems more reasonable, but it got changed to i64 in later schema casting MR.
  • The failed python test test_parse_stats_with_new_schema is definitely a valid test failure that we should fix @PeterKeDer . Although it might be related to the same use of can_cast_types in try_cast_batch issue mentioned in my first point.

@rtyler rtyler marked this pull request as draft October 8, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants