Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Streaming Catalog Writes #3160

Merged
merged 11 commits into from
Nov 6, 2024
Merged

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Oct 31, 2024

Implements streaming Iceberg and Delta writes for swordfish. Most of the write scaffolding has already been implemented in #2992, this PR implements the Iceberg/Delta specific functionalities.

A quick TLDR on swordfish writes:

  • All of the row group sizing, file sizing, partitioning, is now handled in the daft-writer crate.
  • Only the actual writing + flushing is currently handled via Pyarrow Parquet + Csv writers. We intend to build our own native writers in the future.

Notes:

  • Modified the iceberg writes such that:
    1. the plan now stores just the spec id + partition cols (we used to keep the whole partitionspec object in the plan but only use the id, maybe we planned on keeping it around for future work, not sure tho pls lmk)
    2. I made the add_missing_columns stuff an explicit projection. It was a lil cleaner this way instead of having swordfish implement add_missing_columns internally.

@github-actions github-actions bot added the enhancement New feature or request label Oct 31, 2024
Copy link

codspeed-hq bot commented Nov 1, 2024

CodSpeed Performance Report

Merging #3160 will not alter performance

Comparing colin/streaming-catalog-writes-2 (073b2a5) with main (8ed174c)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Nov 1, 2024

Codecov Report

Attention: Patch coverage is 75.55556% with 110 lines in your changes missing coverage. Please review.

Project coverage is 79.13%. Comparing base (8817a08) to head (073b2a5).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
daft/io/writer.py 0.00% 65 Missing ⚠️
daft/delta_lake/delta_lake_write.py 72.60% 20 Missing ⚠️
daft/iceberg/iceberg_write.py 59.45% 15 Missing ⚠️
src/daft-writers/src/catalog.rs 92.72% 4 Missing ⚠️
src/daft-writers/src/lib.rs 95.52% 3 Missing ⚠️
src/daft-local-execution/src/pipeline.rs 97.22% 1 Missing ⚠️
src/daft-physical-plan/src/translate.rs 50.00% 1 Missing ⚠️
src/daft-writers/src/python.rs 98.64% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3160      +/-   ##
==========================================
+ Coverage   78.66%   79.13%   +0.47%     
==========================================
  Files         634      637       +3     
  Lines       78175    77944     -231     
==========================================
+ Hits        61496    61682     +186     
+ Misses      16679    16262     -417     
Files with missing lines Coverage Δ
daft/execution/execution_step.py 89.43% <100.00%> (+0.39%) ⬆️
daft/execution/physical_plan.py 94.01% <ø> (+0.14%) ⬆️
daft/execution/rust_physical_plan_shim.py 95.60% <ø> (+1.03%) ⬆️
daft/logical/builder.py 89.87% <100.00%> (+0.26%) ⬆️
daft/table/table_io.py 88.75% <100.00%> (+3.00%) ⬆️
src/daft-local-execution/src/sinks/write.rs 100.00% <100.00%> (ø)
src/daft-physical-plan/src/local_plan.rs 96.36% <100.00%> (+0.26%) ⬆️
src/daft-plan/src/builder.rs 82.16% <100.00%> (+0.69%) ⬆️
src/daft-plan/src/sink_info.rs 20.54% <ø> (ø)
src/daft-scheduler/src/scheduler.rs 93.18% <100.00%> (+0.05%) ⬆️
... and 8 more

... and 75 files with indirect coverage changes

@colin-ho colin-ho marked this pull request as ready for review November 1, 2024 01:11
@colin-ho colin-ho requested a review from kevinzwang November 1, 2024 01:11
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I just have a few small things that can be cleaned up

daft/io/writer.py Outdated Show resolved Hide resolved
daft/io/writer.py Outdated Show resolved Hide resolved
@colin-ho colin-ho merged commit 9d0dd60 into main Nov 6, 2024
40 checks passed
@colin-ho colin-ho deleted the colin/streaming-catalog-writes-2 branch November 6, 2024 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants