-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_internal.DeltaError when merging #2084
Comments
@halvorlu do you have concurrent writers to the same table? |
There should not be concurrent writers, no. |
@halvorlu Is this table partitioned? Can you add additional logging of the schema for the source data? |
I added this logging of field names:
And the field names are identical. |
Do you know where this error is thrown? I've tried searching in the code, but haven't found out where. |
And yes, the table is partitioned on two columns. |
@halvorlu can you try making an minimal reproducible example with sample data? |
I have the same problem! |
@ion-elgreco I've tried to reproduce this reliably, but haven't really been able to... |
I'm using merge |
Context:
|
Any news about this? |
@rtyler do you have any thoughts here? Could it be related to something with the logstore implementation? |
I hit this issue this week when doing merges from multiple concurrent writers with dynamodb locking provider enabled. From the analysis I did, the error seems to originate in the retry recovery after getting a TransactionError::VersionAlreadyExists error. That function tries to resolve the table merger predicate against the pyarrow schema from a default datafusion session. This will fail with a schema exception because the merge predicate ("target._merge_key = source._merge_key") is only valid against the joint schema of 'source' and 'target' as built in operations/merge build_join_schema. It seems like merge execute needs to build up proper predicates for use in retry function based on all the inputs and add them to the DeltaOperation::Merge operation so that is available in commit_with_retries. Due to the complexity of that merge code, I ended up just adding user-level retry logic around the whole merge.execute() call and will pay the extra expense of writing data again when this occurs (even if they are separate partitions and should succeed) aside - one of my colleagues currently runs into another error "_internal.DeltaError: Invalid table version: 202" originating from that same retry block when calling write_deltalake mode="overwrite" from 5 concurrent writers (in a test). I can't recreate, but have significantly more latency to our S3 region. I haven't investigated this one as deeply, but seems like another edge case to be aware of within the retry logic. |
I have also seen this issue pop up when executing 2 merges against the same table sequentially. We have a case where we delete based on a key and then insert some data in subsequent merge statements. |
I have concurrent writes and use DynamoDB, I think the error should be more clear in case that happens. |
It seems that the rust merge commitInfo implemantation is different to the Spark/Databricks implementation. Delta-rs: If we want to solve this problem, we have to change the predicate structure, the easiest way would be to remove the alias, but it's still different to the Spark implementation. In delta-rs, I believe this predicate is currently only used for the merge operation |
@JonasDev1 what's the issue here? The commitInfo is basically free format and not a requirement of the protocol. |
The issue is that the conflict checker doesn't recognise alias names such as source and target in operationParameters.predicate and think that it's a struct field. Possible solution: If the conflict checker has issues with But as you mentioned, the commitInfo is free and we can keep the I will try this out in the next days. |
# Description This merge request will change the commitInfo.operationParameters.predicate of the merge operation. This is required to allow conflict checking on concurrent commits. Before, the predicate was just the simple merge predicate like `source.eventId = target.eventId`, but the conflict checker doesn't know these aliases and doesn't have access to the source df. So I now use the early_filter, which is also used before to filter a subset of the target table. Basically, the early_filter only contains static filters and partition filters that are converted to fixed values based on the source data. The early_filter can be None if no column/partition pre-filtering is possible, or if the merge contains a not_match_source operation. (See the generalize_filter function in the file). The commitInfo predicate uses exactly this filter, except that the target alias is removed. The predicate is used by the conflict checker, for example when there are multiple concurrent merges. If there is a predicate, the conflict checker will check whether the concurrent commit wrote or deleted data within that predicate. If the predicate is None, the conflict checker will treat the commit as a `read_whole_table' and interpret any concurrent updates as a conflict. Example: Target table with partition country Merge with predicate `source.id = target.id AND target.country='DE'` -> CommitInfo predicate `country='DE'` Merge with predicate `source.id = target.id AND target.country=source.country` -> CommitInfo predicate `country='DE' OR country='US'` Merge with predicate `source.id = target.id` -> CommitInfo predicate None (As full target table join is required) # Related Issue(s) - closes #2084 - closes #2227 --------- Co-authored-by: Jonas Schmitz <[email protected]> Co-authored-by: David Blajda <[email protected]>
This issue is now fixed, but there is another issue after the commit. |
Environment
Delta-rs version: 0.15.1
Binding: Python
Environment:
Bug
What happened:
I try to merge data into an existing deltatable like so:
This sometimes works, and sometimes fails with the following error:
_internal.DeltaError: Generic DeltaTable error: Schema error: No field named target._merge_key. Valid fields are _merge_key, <other camel-case columns>
What you expected to happen:
I expect the merge to succeed. The error claims that
target._merge_key
is not a valid field name, but_merge_key
is, which is a bit strange.How to reproduce it:
I have had trouble reproducing this locally and/or reliably.
The merge operations are run as part of a batch job which begins with
followed by a number of merge operations. If the first merge fails, all subsequent merge operations also fail. Sometimes the first merge succeeds, and then all subsequent merges also seem to succeed. So I wonder if any of these optimize/vacuum/cleanup methods could (sometimes) corrupt the table? (Maybe relevant: Are these operations run synchronously or asynchronously?)
The text was updated successfully, but these errors were encountered: