Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: made generalize_filter less permissive, also added more cases #2149

Merged
merged 3 commits into from
Feb 1, 2024

Conversation

emcake
Copy link
Contributor

@emcake emcake commented Jan 30, 2024

Description

This fixes an observed bug where the partition generalization was failing. A minimal repro was:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
import pandas as pd

data = pd.DataFrame.from_dict(
    {
        "a": [],
        "b": [],
        "c": [],
    }
)
schema = pa.schema(
    [
        ("a", pa.string()),
        ("b", pa.int32()),
        ("c", pa.int32()),
    ]
)

table = pa.Table.from_pandas(data, schema=schema)


write_deltalake(
    "test",
    table,
    mode="overwrite",
    partition_by="a"
)

new_data = pd.DataFrame.from_dict(
    {
        "a": ["a", "a", "a"],
        "b": [None, 2, 4],
        "c": [5, 6, 7],
    }
)
new_table = pa.Table.from_pandas(new_data, schema)

dt = DeltaTable("test")
dt.merge(
    source=new_table,
    predicate="s.b IS NULL",
    source_alias="s",
    target_alias="t",

).when_matched_update_all().when_not_matched_insert_all().execute()

This would cause a DataFusion error:

_internal.DeltaError: Generic DeltaTable error: Optimizer rule 'simplify_expressions' failed
caused by
Schema error: No field named s.b. Valid fields are t.a, t.b, t.c, t.__delta_rs_path.

This was because when generalizing the match predicate to use as a partition filter, an expression IsNull(Column('b', 's')) was deemed to not reference the table s.

This PR does two things:

  1. Tightens up the referencing logic. Previous it conflated 'definitely does not reference' with 'don't know if it references or not'. This tightens up the logic and means the plan is less likely to generalize out a partition filter if we can't be sure that it can be generalized. The ability to generalize over arbitrary binary expressions has been tightened too - previously behaviour would permit that generalizable_expression OR non_generalizable_expression would reduce to generalizable_expression. This isn't correct in the case of partition filters, because this would cause us to leave out half the cases that should be extracted from the target table.

  2. Adds a couple of extra cases where we know if a target reference exists. Namely, is null can now be checked for source table references and literal is now re-covered, as previously it was working by taking advantage of looser logic that has since been tightened.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Jan 30, 2024
Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for contributing this fix!

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@roeap roeap merged commit 3ec28cc into delta-io:main Feb 1, 2024
20 checks passed
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this pull request Feb 2, 2024
…elta-io#2149)

# Description
This fixes an observed bug where the partition generalization was
failing. A minimal repro was:

```python
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
import pandas as pd

data = pd.DataFrame.from_dict(
    {
        "a": [],
        "b": [],
        "c": [],
    }
)
schema = pa.schema(
    [
        ("a", pa.string()),
        ("b", pa.int32()),
        ("c", pa.int32()),
    ]
)

table = pa.Table.from_pandas(data, schema=schema)


write_deltalake(
    "test",
    table,
    mode="overwrite",
    partition_by="a"
)

new_data = pd.DataFrame.from_dict(
    {
        "a": ["a", "a", "a"],
        "b": [None, 2, 4],
        "c": [5, 6, 7],
    }
)
new_table = pa.Table.from_pandas(new_data, schema)

dt = DeltaTable("test")
dt.merge(
    source=new_table,
    predicate="s.b IS NULL",
    source_alias="s",
    target_alias="t",

).when_matched_update_all().when_not_matched_insert_all().execute()
```

This would cause a DataFusion error:

```
_internal.DeltaError: Generic DeltaTable error: Optimizer rule 'simplify_expressions' failed
caused by
Schema error: No field named s.b. Valid fields are t.a, t.b, t.c, t.__delta_rs_path.
```

This was because when generalizing the match predicate to use as a
partition filter, an expression `IsNull(Column('b', 's'))` was deemed to
not reference the table `s`.

This PR does two things:

1. **Tightens up the referencing logic.** Previous it conflated
'definitely does not reference' with 'don't know if it references or
not'. This tightens up the logic and means the plan is less likely to
generalize out a partition filter if we can't be sure that it can be
generalized. The ability to generalize over arbitrary binary expressions
has been tightened too - previously behaviour would permit that
`generalizable_expression OR non_generalizable_expression` would reduce
to `generalizable_expression`. This isn't correct in the case of
partition filters, because this would cause us to leave out half the
cases that should be extracted from the target table.

2. **Adds a couple of extra cases where we know if a target reference
exists.** Namely, `is null` can now be checked for source table
references and `literal` is now re-covered, as previously it was working
by taking advantage of looser logic that has since been tightened.

Co-authored-by: David Blajda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants