_internal.DeltaError when merging #2084

halvorlu · 2024-01-16T13:59:28Z

Environment

Delta-rs version: 0.15.1

Binding: Python

Environment:

Cloud provider: Azure (deltatable stored in Azure)
OS: Linux/OpenShift

Bug

What happened:
I try to merge data into an existing deltatable like so:

deltatable.merge(
                mergeable_table,
                f"target._merge_key = source._merge_key",
                source_alias="source",
                target_alias="target",
            )
            .when_matched_update_all()
            .when_not_matched_insert_all()
            .execute()

This sometimes works, and sometimes fails with the following error:
_internal.DeltaError: Generic DeltaTable error: Schema error: No field named target._merge_key. Valid fields are _merge_key, <other camel-case columns>

What you expected to happen:
I expect the merge to succeed. The error claims that target._merge_key is not a valid field name, but _merge_key is, which is a bit strange.

How to reproduce it:
I have had trouble reproducing this locally and/or reliably.
The merge operations are run as part of a batch job which begins with

deltatable.optimize.compact()
deltatable.vacuum(dry_run=False, retention_hours=48, enforce_retention_duration=False)
deltatable.cleanup_metadata()

followed by a number of merge operations. If the first merge fails, all subsequent merge operations also fail. Sometimes the first merge succeeds, and then all subsequent merges also seem to succeed. So I wonder if any of these optimize/vacuum/cleanup methods could (sometimes) corrupt the table? (Maybe relevant: Are these operations run synchronously or asynchronously?)

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-01-16T17:58:55Z

@halvorlu do you have concurrent writers to the same table?

halvorlu · 2024-01-16T20:47:48Z

There should not be concurrent writers, no.

Blajda · 2024-01-17T02:51:17Z

@halvorlu Is this table partitioned?
One hunch I have is that that merge is being executed when with zero records from the source.
Ideally the operation should be able to handle this case without issue.
Tested with an empty source with the correct schema and no issue.

Can you add additional logging of the schema for the source data?

halvorlu · 2024-01-18T09:26:28Z

I added this logging of field names:

        sorted_field_list = sorted([f.name for f in deltatable.schema().fields])
        logger.info(f"Sorted field list of existing schema: {sorted_field_list}")
        logger.info(f"Sorted field list of new data: {sorted(mergeable_table.schema.names)}")

And the field names are identical.

halvorlu · 2024-01-19T07:36:25Z

Do you know where this error is thrown? I've tried searching in the code, but haven't found out where.
_internal.DeltaError: Generic DeltaTable error: Schema error: No field named target._merge_key. Valid fields are _merge_key, <other camel-case columns>

halvorlu · 2024-01-19T07:36:59Z

And yes, the table is partitioned on two columns.

ion-elgreco · 2024-01-19T07:37:53Z

@halvorlu can you try making an minimal reproducible example with sample data?

germanmunuera · 2024-01-24T15:20:29Z

I have the same problem!
Multiple writes to a table partitioned by one column.
The first write is fine, subsequent writes throw DeltaError exception and the _delta_log directory contains several _commit_xxxxx.json.tmp files.
Data is saved to S3 but Delta Lake versions are not updated.

halvorlu · 2024-01-25T08:08:12Z

@ion-elgreco I've tried to reproduce this reliably, but haven't really been able to...
@germanmunuera Are you using write or merge?

germanmunuera · 2024-01-25T08:29:54Z

I'm using merge

germanmunuera · 2024-01-26T09:54:03Z

Context:

Created DynamoDB table and IAM role

		aws dynamodb create-table \
			--table-name delta_rs_lock_table \
			--region eu-west-1 \
			--attribute-definitions \
				AttributeName=tablePath,AttributeType=S \
				AttributeName=fileName,AttributeType=S \
			--key-schema \
				AttributeName=tablePath,KeyType=HASH \
				AttributeName=fileName,KeyType=RANGE \
		    --provisioned-throughput \
		        ReadCapacityUnits=10,WriteCapacityUnits=10

		{
			"Version": "2012-10-17",
			"Statement": [
				{
					"Sid": "DeltalakeDynamoLock",
					"Effect": "Allow",
					"Action": [
						"dynamodb:GetItem",
						"dynamodb:PutItem",
						"dynamodb:DeleteItem",
						"dynamodb:Query",
						"dynamodb:UpdateItem"
					],
					"Resource": "arn:aws:dynamodb:eu-west-1:xxxxxxx:table/delta_rs_lock_table"
				}
			]
		}

Sets AWS_S3_LOCKING_PROVIDER = 'dynamodb' and DELTA_DYNAMO_TABLE_NAME = 'delta_rs_lock_table'
Executing several writes for different partitions (the tasks are orchestrated with Dagster). I'm using Polars but with DeltaTable you have the same results

	    def upsert(self, target: str, df: pl.DataFrame, partition_by: str | Sequence[str] | None = None) -> None:
	        if df.is_empty():
	            return

	        storage_options = self.storage_options.dict() if self.storage_options else {}
	        storage_options['AWS_S3_LOCKING_PROVIDER'] = 'dynamodb'
	        storage_options['DELTA_DYNAMO_TABLE_NAME'] = 'delta_rs_lock_table'

	        delta_schema: Any = _convert_pa_schema_to_delta(
	            df.to_arrow().schema, True)
	        partition_by = [] if partition_by is None else [partition_by] if isinstance(
	            partition_by, str) else partition_by
	        delta_write_options = {
	            'schema': delta_schema,
	            'partition_by': partition_by
	        }

	        try:
	            delta_merge_options = {
	                'predicate': 's.id = t.id',
	                'source_alias': 's',
	                'target_alias': 't'
	            }
	            (
	                df.write_delta(target, mode='merge', overwrite_schema=True, storage_options=storage_options,
	                               delta_merge_options=delta_merge_options, delta_write_options=delta_write_options)
	                .when_matched_update_all()
	                .when_not_matched_insert_all()
	                .execute()
	            )
	        except TableNotFoundError as err:
	            df.write_delta(target, mode='append', overwrite_schema=True, storage_options=storage_options,
	                           delta_write_options=delta_write_options)

The table exists in S3. The first process works, subsequent processes all failed

If you execute the tasks sequentially it works without problems

germanmunuera · 2024-02-01T13:25:32Z

Any news about this?

ion-elgreco · 2024-02-01T13:47:40Z

@rtyler do you have any thoughts here? Could it be related to something with the logstore implementation?

ericvandever · 2024-02-02T07:08:32Z

I hit this issue this week when doing merges from multiple concurrent writers with dynamodb locking provider enabled. From the analysis I did, the error seems to originate in the retry recovery after getting a TransactionError::VersionAlreadyExists error. That function tries to resolve the table merger predicate against the pyarrow schema from a default datafusion session. This will fail with a schema exception because the merge predicate ("target._merge_key = source._merge_key") is only valid against the joint schema of 'source' and 'target' as built in operations/merge build_join_schema.

It seems like merge execute needs to build up proper predicates for use in retry function based on all the inputs and add them to the DeltaOperation::Merge operation so that is available in commit_with_retries. Due to the complexity of that merge code, I ended up just adding user-level retry logic around the whole merge.execute() call and will pay the extra expense of writing data again when this occurs (even if they are separate partitions and should succeed)

aside - one of my colleagues currently runs into another error "_internal.DeltaError: Invalid table version: 202" originating from that same retry block when calling write_deltalake mode="overwrite" from 5 concurrent writers (in a test). I can't recreate, but have significantly more latency to our S3 region. I haven't investigated this one as deeply, but seems like another edge case to be aware of within the retry logic.

ericvandever · 2024-02-02T07:11:16Z

There should not be concurrent writers, no.

I have also seen this issue pop up when executing 2 merges against the same table sequentially. We have a case where we delete based on a key and then insert some data in subsequent merge statements.

franz101 · 2024-03-03T10:59:13Z

@halvorlu do you have concurrent writers to the same table?

I have concurrent writes and use DynamoDB, I think the error should be more clear in case that happens.

JonasDev1 · 2024-03-13T22:00:21Z

It seems that the rust merge commitInfo implemantation is different to the Spark/Databricks implementation.
Databricks:
"operationParameters":{ "predicate":"[\"((eventId#33510940 = eventId#83944) AND (country#33510941 = country#83945))\"]", "matchedPredicates":"[{\"predicate\":\"(meta#33510942[event_time] < meta#83982[event_time])\",\"actionType\":\"update\"}]", "notMatchedPredicates":"[{\"actionType\":\"insert\"}]", "notMatchedBySourcePredicates":"[]" },
(I don't know what this eventId#33510940 means, but I think its a ref to a parquet file?)

Delta-rs:
"operationParameters":{ "matchedPredicates":"[{\"actionType\":\"update\",\"predicate\":\"source.meta_event_name != 'REMOVE'\"},{\"actionType\":\"delete\",\"predicate\":\"source.meta_event_name = 'REMOVE'\"}]", "predicate":"source.eventId = target.eventId", "notMatchedBySourcePredicates":"[]", "notMatchedPredicates":"[{\"actionType\":\"insert\",\"predicate\":\"source.meta_event_name != 'REMOVE'\"}]" },

If we want to solve this problem, we have to change the predicate structure, the easiest way would be to remove the alias, but it's still different to the Spark implementation. In delta-rs, I believe this predicate is currently only used for the merge operation

ion-elgreco · 2024-03-13T23:36:24Z

@JonasDev1 what's the issue here?

The commitInfo is basically free format and not a requirement of the protocol.

JonasDev1 · 2024-03-15T08:22:15Z

The issue is that the conflict checker doesn't recognise alias names such as source and target in operationParameters.predicate and think that it's a struct field.

Possible solution:
Remove the alias from the predicate source.eventId = target.eventId --> eventId = eventId and check if the conflict checker will accept this. Because it useless to keep the alias for conflict checks, as the conflict checker has no mapping to the original name anyway.

If the conflict checker has issues with eventId = eventId, we also need to remove predicates without static values(static is for example: country = 'us') .

But as you mentioned, the commitInfo is free and we can keep the source.eventId = target.eventId in separate field mergePredicate.

I will try this out in the next days.

# Description This merge request will change the commitInfo.operationParameters.predicate of the merge operation. This is required to allow conflict checking on concurrent commits. Before, the predicate was just the simple merge predicate like `source.eventId = target.eventId`, but the conflict checker doesn't know these aliases and doesn't have access to the source df. So I now use the early_filter, which is also used before to filter a subset of the target table. Basically, the early_filter only contains static filters and partition filters that are converted to fixed values based on the source data. The early_filter can be None if no column/partition pre-filtering is possible, or if the merge contains a not_match_source operation. (See the generalize_filter function in the file). The commitInfo predicate uses exactly this filter, except that the target alias is removed. The predicate is used by the conflict checker, for example when there are multiple concurrent merges. If there is a predicate, the conflict checker will check whether the concurrent commit wrote or deleted data within that predicate. If the predicate is None, the conflict checker will treat the commit as a `read_whole_table' and interpret any concurrent updates as a conflict. Example: Target table with partition country Merge with predicate `source.id = target.id AND target.country='DE'` -> CommitInfo predicate `country='DE'` Merge with predicate `source.id = target.id AND target.country=source.country` -> CommitInfo predicate `country='DE' OR country='US'` Merge with predicate `source.id = target.id` -> CommitInfo predicate None (As full target table join is required) # Related Issue(s) - closes #2084 - closes #2227 --------- Co-authored-by: Jonas Schmitz <[email protected]> Co-authored-by: David Blajda <[email protected]>

JonasDev1 · 2024-03-19T08:01:36Z

This issue is now fixed, but there is another issue after the commit.
The merge can't return the new table because the snapshot is not updated.
If you don't need the table afterwards, you can hande a DeltaTableError::Generic("Version mismatch") error as success result.

#2279

halvorlu added the bug Something isn't working label Jan 16, 2024

echai58 mentioned this issue Feb 21, 2024

DELTA_FILE_PATTERN regex is incorrectly matching tmp commit files #2201

Closed

germanmunuera mentioned this issue Feb 29, 2024

Merging to a table with multiple distinct partitions in parallel fails #2227

Closed

JonasDev1 mentioned this issue Mar 15, 2024

fix: merge predicate for concurrent writes #2291

Merged

ion-elgreco closed this as completed in #2291 Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_internal.DeltaError when merging #2084

_internal.DeltaError when merging #2084

halvorlu commented Jan 16, 2024

ion-elgreco commented Jan 16, 2024

halvorlu commented Jan 16, 2024

Blajda commented Jan 17, 2024 •

edited

Loading

halvorlu commented Jan 18, 2024 •

edited

Loading

halvorlu commented Jan 19, 2024

halvorlu commented Jan 19, 2024

ion-elgreco commented Jan 19, 2024

germanmunuera commented Jan 24, 2024

halvorlu commented Jan 25, 2024

germanmunuera commented Jan 25, 2024

germanmunuera commented Jan 26, 2024 •

edited

Loading

germanmunuera commented Feb 1, 2024

ion-elgreco commented Feb 1, 2024

ericvandever commented Feb 2, 2024

ericvandever commented Feb 2, 2024 •

edited

Loading

franz101 commented Mar 3, 2024

JonasDev1 commented Mar 13, 2024

ion-elgreco commented Mar 13, 2024

JonasDev1 commented Mar 15, 2024 •

edited

Loading

JonasDev1 commented Mar 19, 2024

_internal.DeltaError when merging #2084

_internal.DeltaError when merging #2084

Comments

halvorlu commented Jan 16, 2024

Environment

Bug

ion-elgreco commented Jan 16, 2024

halvorlu commented Jan 16, 2024

Blajda commented Jan 17, 2024 • edited Loading

halvorlu commented Jan 18, 2024 • edited Loading

halvorlu commented Jan 19, 2024

halvorlu commented Jan 19, 2024

ion-elgreco commented Jan 19, 2024

germanmunuera commented Jan 24, 2024

halvorlu commented Jan 25, 2024

germanmunuera commented Jan 25, 2024

germanmunuera commented Jan 26, 2024 • edited Loading

germanmunuera commented Feb 1, 2024

ion-elgreco commented Feb 1, 2024

ericvandever commented Feb 2, 2024

ericvandever commented Feb 2, 2024 • edited Loading

franz101 commented Mar 3, 2024

JonasDev1 commented Mar 13, 2024

ion-elgreco commented Mar 13, 2024

JonasDev1 commented Mar 15, 2024 • edited Loading

JonasDev1 commented Mar 19, 2024

Blajda commented Jan 17, 2024 •

edited

Loading

halvorlu commented Jan 18, 2024 •

edited

Loading

germanmunuera commented Jan 26, 2024 •

edited

Loading

ericvandever commented Feb 2, 2024 •

edited

Loading

JonasDev1 commented Mar 15, 2024 •

edited

Loading