Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Rust-backed engine for write_deltalake #1861

Closed
wjones127 opened this issue Nov 14, 2023 · 4 comments · Fixed by #1891
Closed

Add a Rust-backed engine for write_deltalake #1861

wjones127 opened this issue Nov 14, 2023 · 4 comments · Fixed by #1891
Assignees
Labels
binding/python Issues for the Python package enhancement New feature or request
Milestone

Comments

@wjones127
Copy link
Collaborator

Description

Right now we've built on top of the PyArrow writers. This requires a lot of complex code that is essentially duplicating logic in Rust. The main motivation for writing it was that the Rust implementation wasn't ready, so it was faster to build on top of PyArrow. That might not be true anymore.

We can update the signature of write_deltalake() to take an engine parameter (sort of like how Pandas read_parquet has this parameter), which would let users choose to use the pyarrow engine or the Rust engine for now. Eventually we can switch the default and deprecate the pyarrow implementation.

We should be on the lookout for issues that block this. First, we need to make sure the same unit tests pass with the new writer. So we should parametrize all tests by the engine.

Second, we should be on the lookout for performance issues. We have a set of benchmarks here:

def test_benchmark_write(benchmark, sample_table, tmp_path):
, which we can add to. One known issue (that might have been solved) is: #1225

Use Case

Related Issue(s)

@wjones127 wjones127 added enhancement New feature or request binding/python Issues for the Python package labels Nov 14, 2023
@ion-elgreco ion-elgreco self-assigned this Nov 15, 2023
@roeap
Copy link
Collaborator

roeap commented Nov 16, 2023

we should use this opportunity to also consolidate our writer implementations. right now we have one in /writer as well as /operations/writer. The one in operations being the newer one. Originally it was meant to have two implementations only for a short while, but this has been the case for many months now 😆.

The /writer one was mainly kept around to keep supporting the kafka-delta-ingest use case to also be able to writer json data.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Nov 16, 2023

@roeap I am mainly exposing operations/write to Python, not writer

@ion-elgreco ion-elgreco modified the milestone: python v0.14 Nov 22, 2023
ion-elgreco added a commit that referenced this issue Nov 29, 2023
# Description
- Adds rust writer as additional engine in python
- Adds overwrite schema functionality to the rust writer. @roeap feel
free to point out improvements 😄

A couple gaps will exist between current Rust writer and pyarrow writer.
We will have to solve this in a later PR:
- Replacewhere (partition filter / predicate) overwrite  
(users however can solve this by doing DeltaTabel.delete and then
append)

# Related Issue(s)
- closes #1861

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: David Blajda <[email protected]>
Co-authored-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Matthew Powers <[email protected]>
Co-authored-by: Thomas Frederik Hoeck <[email protected]>
Co-authored-by: Adrian Ehrsam <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Marijn Valk <[email protected]>
@roeap
Copy link
Collaborator

roeap commented Nov 29, 2023

I'll re-open this to keep track until we have a more full-featured rust writer?

@roeap roeap reopened this Nov 29, 2023
@ion-elgreco
Copy link
Collaborator

@roeap yeah good one!

ion-elgreco added a commit to ion-elgreco/delta-rs that referenced this issue Dec 1, 2023
- Adds rust writer as additional engine in python
- Adds overwrite schema functionality to the rust writer. @roeap feel
free to point out improvements 😄

A couple gaps will exist between current Rust writer and pyarrow writer.
We will have to solve this in a later PR:
- Replacewhere (partition filter / predicate) overwrite
(users however can solve this by doing DeltaTabel.delete and then
append)

- closes delta-io#1861

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: David Blajda <[email protected]>
Co-authored-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Matthew Powers <[email protected]>
Co-authored-by: Thomas Frederik Hoeck <[email protected]>
Co-authored-by: Adrian Ehrsam <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Marijn Valk <[email protected]>
@rtyler rtyler closed this as completed Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
4 participants