You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently merge in python delta API does two things:
It executes business logic based on merge conditions
It writes changes to s3/fs etc — data access
And it makes impossible to write good (fast & cheap) tests.
Currently an average piece of code looks like this:
defdoSomething(DeltaTabledeltaTable, DataFramenewDedupedLogs) ->void:
deltaTable.alias("logs") \
.merge( \
newDedupedLogs.alias("newDedupedLogs"), \
"logs.uniqueId = newDedupedLogs.uniqueId" \
) \
.whenNotMatchedInsertAll() \
.execute() # note: it does save data right here
To test it we have only one option for now — it is to write integration tests which write data to s3/fs.
It is very slow and expensive.
It would be nice to split these responsibilities and make it possible to merge without saving data.
The code could look like this basically:
And this code could be tested without handling 'side effects' like data access.
Motivation
It is important cuz makes testing cheaper and faster.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.
The text was updated successfully, but these errors were encountered:
Feature request
Which Delta project/connector is this regarding?
Overview
Hi!
Currently
merge
in python delta API does two things:And it makes impossible to write good (fast & cheap) tests.
Currently an average piece of code looks like this:
To test it we have only one option for now — it is to write integration tests which write data to s3/fs.
It is very slow and expensive.
It would be nice to split these responsibilities and make it possible to merge without saving data.
The code could look like this basically:
And this code could be tested without handling 'side effects' like data access.
Motivation
It is important cuz makes testing cheaper and faster.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: