fix: remove unnecessary metadata action when overwriting partition #2923

PeterKeDer · 2024-10-05T00:42:48Z

Description

Fixes spurious metadata action when write_deltalake is called with mode overwrite, using a predicate and with a string partition column. This is undesirable because all concurrent writes will fail and need to be retried due to this metadata action.

This is caused by the schema != table_schema check. table_schema is from calling table.input_schema() which converts string partition columns to dictionary and causes it to be different from schema.

We fix this issue by comparing it using try_cast_batch, so the behavior becomes identical to writing with mode='append'.

To replicate (on deltalake==0.20.1):

from deltalake import DeltaTable, write_deltalake
import polars as pl

df1 = pl.DataFrame({'id': ['a', 'b'], 'val': [1,2]})

write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])
write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])
write_deltalake('testtable1', df1.to_arrow(), schema_mode='merge', mode='overwrite', partition_by=['id'])

If we look at the latter 2 transaction JSONs, they will have a metadata action indicating a schema change, even though the schema is identical.

houqp

Would be good to add a test as follow up.

ion-elgreco · 2024-10-05T06:21:39Z

crates/core/src/operations/write.rs

@@ -1075,7 +1075,7 @@ impl std::future::IntoFuture for WriteBuilder {
                        actions.push(protocol.into())
                    }

-                    if schema != table_schema {
+                    if try_cast_batch(schema.fields(), table_schema.fields()).is_err() {


Try cast batch is too lenient, you can already see it in the test failure.

There are a couple issues that are worth digging into:

can_cast_types seems like the wrong function to use in try_cast_batch considering it allows incompatible type downcasting, i.e. casting from i64 to i8.

The failing rust test test_issue_2105 has the id column type defined as PrimitiveType::Integer, which based on delta spec should map to 4 bytes int (i32). But the test expects the query result to have the type ArrowDataType::Int64. Is this expected? The original MR that introduced this test has the expected type defined as ArrowDataType::Int32, which seems more reasonable, but it got changed to i64 in later schema casting MR.

The failed python test test_parse_stats_with_new_schema is definitely a valid test failure that we should fix @PeterKeDer . Although it might be related to the same use of can_cast_types in try_cast_batch issue mentioned in my first point.

fix

58240a9

PeterKeDer requested review from wjones127, roeap, rtyler, hntd187 and ion-elgreco as code owners October 5, 2024 00:42

github-actions bot added the binding/rust Issues for the Rust crate label Oct 5, 2024

houqp approved these changes Oct 5, 2024

View reviewed changes

houqp enabled auto-merge October 5, 2024 04:35

ion-elgreco requested changes Oct 5, 2024

View reviewed changes

ion-elgreco disabled auto-merge October 5, 2024 06:21

rtyler marked this pull request as draft October 8, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove unnecessary metadata action when overwriting partition #2923

fix: remove unnecessary metadata action when overwriting partition #2923

PeterKeDer commented Oct 5, 2024

houqp left a comment

ion-elgreco Oct 5, 2024

houqp Oct 6, 2024

fix: remove unnecessary metadata action when overwriting partition #2923

Are you sure you want to change the base?

fix: remove unnecessary metadata action when overwriting partition #2923

Conversation

PeterKeDer commented Oct 5, 2024

Description

houqp left a comment

Choose a reason for hiding this comment

ion-elgreco Oct 5, 2024

Choose a reason for hiding this comment

houqp Oct 6, 2024

Choose a reason for hiding this comment