fix: made generalize_filter less permissive, also added more cases #2149

emcake · 2024-01-30T14:25:02Z

Description

This fixes an observed bug where the partition generalization was failing. A minimal repro was:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
import pandas as pd

data = pd.DataFrame.from_dict(
    {
        "a": [],
        "b": [],
        "c": [],
    }
)
schema = pa.schema(
    [
        ("a", pa.string()),
        ("b", pa.int32()),
        ("c", pa.int32()),
    ]
)

table = pa.Table.from_pandas(data, schema=schema)


write_deltalake(
    "test",
    table,
    mode="overwrite",
    partition_by="a"
)

new_data = pd.DataFrame.from_dict(
    {
        "a": ["a", "a", "a"],
        "b": [None, 2, 4],
        "c": [5, 6, 7],
    }
)
new_table = pa.Table.from_pandas(new_data, schema)

dt = DeltaTable("test")
dt.merge(
    source=new_table,
    predicate="s.b IS NULL",
    source_alias="s",
    target_alias="t",

).when_matched_update_all().when_not_matched_insert_all().execute()

This would cause a DataFusion error:

_internal.DeltaError: Generic DeltaTable error: Optimizer rule 'simplify_expressions' failed
caused by
Schema error: No field named s.b. Valid fields are t.a, t.b, t.c, t.__delta_rs_path.

This was because when generalizing the match predicate to use as a partition filter, an expression IsNull(Column('b', 's')) was deemed to not reference the table s.

This PR does two things:

Tightens up the referencing logic. Previous it conflated 'definitely does not reference' with 'don't know if it references or not'. This tightens up the logic and means the plan is less likely to generalize out a partition filter if we can't be sure that it can be generalized. The ability to generalize over arbitrary binary expressions has been tightened too - previously behaviour would permit that generalizable_expression OR non_generalizable_expression would reduce to generalizable_expression. This isn't correct in the case of partition filters, because this would cause us to leave out half the cases that should be extracted from the target table.
Adds a couple of extra cases where we know if a target reference exists. Namely, is null can now be checked for source table references and literal is now re-covered, as previously it was working by taking advantage of looser logic that has since been tightened.

Blajda

LGTM thanks for contributing this fix!

roeap

LGTM 👍

…elta-io#2149) # Description This fixes an observed bug where the partition generalization was failing. A minimal repro was: ```python from deltalake import DeltaTable, write_deltalake import pyarrow as pa import pandas as pd data = pd.DataFrame.from_dict( { "a": [], "b": [], "c": [], } ) schema = pa.schema( [ ("a", pa.string()), ("b", pa.int32()), ("c", pa.int32()), ] ) table = pa.Table.from_pandas(data, schema=schema) write_deltalake( "test", table, mode="overwrite", partition_by="a" ) new_data = pd.DataFrame.from_dict( { "a": ["a", "a", "a"], "b": [None, 2, 4], "c": [5, 6, 7], } ) new_table = pa.Table.from_pandas(new_data, schema) dt = DeltaTable("test") dt.merge( source=new_table, predicate="s.b IS NULL", source_alias="s", target_alias="t", ).when_matched_update_all().when_not_matched_insert_all().execute() ``` This would cause a DataFusion error: ``` _internal.DeltaError: Generic DeltaTable error: Optimizer rule 'simplify_expressions' failed caused by Schema error: No field named s.b. Valid fields are t.a, t.b, t.c, t.__delta_rs_path. ``` This was because when generalizing the match predicate to use as a partition filter, an expression `IsNull(Column('b', 's'))` was deemed to not reference the table `s`. This PR does two things: 1. **Tightens up the referencing logic.** Previous it conflated 'definitely does not reference' with 'don't know if it references or not'. This tightens up the logic and means the plan is less likely to generalize out a partition filter if we can't be sure that it can be generalized. The ability to generalize over arbitrary binary expressions has been tightened too - previously behaviour would permit that `generalizable_expression OR non_generalizable_expression` would reduce to `generalizable_expression`. This isn't correct in the case of partition filters, because this would cause us to leave out half the cases that should be extracted from the target table. 2. **Adds a couple of extra cases where we know if a target reference exists.** Namely, `is null` can now be checked for source table references and `literal` is now re-covered, as previously it was working by taking advantage of looser logic that has since been tightened. Co-authored-by: David Blajda <[email protected]>

fix: made generalize_filter less permissive, also added more cases

5a17254

emcake requested review from wjones127, roeap and rtyler as code owners January 30, 2024 14:25

github-actions bot added the binding/rust Issues for the Rust crate label Jan 30, 2024

emcake and others added 2 commits January 30, 2024 22:51

Merge branch 'main' into fix-merge-generalize-filter

b725f6a

Merge branch 'main' into fix-merge-generalize-filter

1791e7f

Blajda approved these changes Feb 1, 2024

View reviewed changes

roeap approved these changes Feb 1, 2024

View reviewed changes

roeap merged commit 3ec28cc into delta-io:main Feb 1, 2024
20 checks passed

This was referenced Feb 2, 2024

Merge on IS NULL condition doesn't work for empty table #2148

Closed

When_matched_update causes records to be lost with explicit predicate #2158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: made generalize_filter less permissive, also added more cases #2149

fix: made generalize_filter less permissive, also added more cases #2149

emcake commented Jan 30, 2024

Blajda left a comment

roeap left a comment

fix: made generalize_filter less permissive, also added more cases #2149

fix: made generalize_filter less permissive, also added more cases #2149

Conversation

emcake commented Jan 30, 2024

Description

Blajda left a comment

Choose a reason for hiding this comment

roeap left a comment

Choose a reason for hiding this comment