-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge update+insert truncates a delta table #2320
Comments
@t1g0rz thanks for the issue, this seems to have been introduced in the latest version, 0.16.1 was still working fine. |
@ion-elgreco Here is my updated code to catch this bug: import numpy as np
import pandas as pd
from deltalake import DeltaTable
import pyarrow as pa
import polars as pl
import string
def create_mock_df(st_idx, end_idx, sets_of_data):
diff = end_idx - st_idx
res = []
for i in range(sets_of_data):
mock_df = pd.DataFrame(np.random.random((diff, 5)), columns=[f"c{i}" for i in range(1, 6)], dtype=str)
mock_df.insert(0, 'iii', range(st_idx, end_idx))
mock_df.insert(1, 'name', np.random.choice(list(string.ascii_uppercase), size=diff))
res.append(mock_df)
return pd.concat(res, ignore_index=True).drop_duplicates(['iii', 'name'])
settings_to_merge = [
(0, 1_400_000, 10),
(1_040_000, 1_045_000, 10),
(1_450_000, 1_500_000, 10),
(1_139_800, 1_600_000, 10),
]
path = 'test'
storage_options=None
DeltaTable.create(path,
storage_options=storage_options,
schema=pa.schema(
[
pa.field('iii', type=pa.int64(), nullable=False),
pa.field('name', type=pa.string(), nullable=False),
pa.field('c1', type=pa.string()),
pa.field('c2', type=pa.string()),
pa.field('c3', type=pa.string()),
pa.field('c4', type=pa.string()),
pa.field('c5', type=pa.string()),
]
)
)
for st_idx, end_idx, sets_of_data in settings_to_merge:
mock_df = create_mock_df(st_idx, end_idx, sets_of_data)
dt = DeltaTable(path, storage_options=storage_options)
es = (
dt.merge(mock_df, predicate=f't.iii > {mock_df.iii.min()} and s.iii = t.iii and s.name = t.name', source_alias='s', target_alias='t')
.when_not_matched_insert_all()
.when_matched_update_all()
.execute()
)
print(es)
print('init df shape:', len(mock_df))
print('delta shape after merge:', pl.scan_delta(path, storage_options=storage_options).select('name').collect().shape)
print('----') Here is the output:
UPD: I understood that I cannot reopen the issue, and I'm not sure if anyone will notice my update, so I created a new one: #2362 |
Environment
Delta-rs version: 0.16.2
Binding: python
Environment:
Bug
What happened:
I attempted to limit the scan of the target table and noticed that if I do so, it simply removes predicates evaluated to false.
What you expected to happen:
I expected updates and inserts to occur according to the predicate.
How to reproduce it:
Below is the code which could help to understand the issue. I'm not entirely sure if it is indeed an issue or if I am simply doing something wrong:
The text was updated successfully, but these errors were encountered: