Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Split merge and data access logic #3880

Open
2 of 8 tasks
aurokk opened this issue Nov 14, 2024 · 0 comments
Open
2 of 8 tasks

[Feature Request] Split merge and data access logic #3880

aurokk opened this issue Nov 14, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@aurokk
Copy link

aurokk commented Nov 14, 2024

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Hi!

Currently merge in python delta API does two things:

  1. It executes business logic based on merge conditions
  2. It writes changes to s3/fs etc — data access

And it makes impossible to write good (fast & cheap) tests.


Currently an average piece of code looks like this:

def doSomething(DeltaTable deltaTable, DataFrame newDedupedLogs) -> void:

    deltaTable.alias("logs") \
        .merge( \
            newDedupedLogs.alias("newDedupedLogs"),  \
            "logs.uniqueId = newDedupedLogs.uniqueId"  \
        ) \
        .whenNotMatchedInsertAll() \
        .execute()  # note: it does save data right here

To test it we have only one option for now — it is to write integration tests which write data to s3/fs.
It is very slow and expensive.


It would be nice to split these responsibilities and make it possible to merge without saving data.
The code could look like this basically:

def doSomething(DataFrame target, DataFrame source) -> DataFrame:
    return DeltaTable \
        .merge( \ # note: static helper method
            target.alias("logs"),  \
            source.alias("newDedupedLogs"),  \
            "logs.uniqueId = newDedupedLogs.uniqueId" \
        )
        .whenNotMatchedInsertAll()  \
        .execute() # note: it returns DataFrame

And this code could be tested without handling 'side effects' like data access.

Motivation

It is important cuz makes testing cheaper and faster.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@aurokk aurokk added the enhancement New feature or request label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant