Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connect): add repartition support #3382

Closed
wants to merge 1 commit into from

Conversation

andrewgazelka
Copy link
Contributor

  • the test is not great but idk how to do it better since rdd does
    not work with spark connect (I think)
  • do we want to support non-shuffle repartitioning?

Copy link
Contributor Author

andrewgazelka commented Nov 21, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch 2 times, most recently from e4dadeb to 9bcb827 Compare November 21, 2024 05:18
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch 2 times, most recently from e1aa22e to c27e74e Compare November 21, 2024 05:20
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 9bcb827 to 886d014 Compare November 21, 2024 16:57
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from c27e74e to cf1ba9d Compare November 21, 2024 16:58
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 886d014 to 14d1d50 Compare November 21, 2024 18:43
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from cf1ba9d to 581eb36 Compare November 21, 2024 18:44
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch 2 times, most recently from 7302bfb to 62ae066 Compare December 4, 2024 02:09
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 581eb36 to 59a7046 Compare December 4, 2024 02:09
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 62ae066 to 6e58760 Compare December 4, 2024 02:35
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 59a7046 to 54df4fd Compare December 4, 2024 02:38
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 6e58760 to 65d47db Compare December 4, 2024 02:45
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 54df4fd to 3d8335e Compare December 4, 2024 02:45
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 65d47db to 670c495 Compare December 4, 2024 08:53
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 3d8335e to 7819f2a Compare December 4, 2024 08:53
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 670c495 to c57e270 Compare December 4, 2024 09:25
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 7819f2a to a97b69b Compare December 4, 2024 09:26
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from c57e270 to 7bdd623 Compare December 4, 2024 23:59
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from a97b69b to 5957c06 Compare December 4, 2024 23:59
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 7bdd623 to 8fcea97 Compare December 5, 2024 00:42
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 5957c06 to 353a6e1 Compare December 5, 2024 00:42
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 8fcea97 to 3038c0f Compare December 5, 2024 18:28
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 353a6e1 to eb6a937 Compare December 5, 2024 18:29
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 3038c0f to 31e8bc8 Compare December 5, 2024 21:40
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from b7848ce to 0de7b03 Compare December 10, 2024 01:51
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 5e7d068 to 003929c Compare December 10, 2024 01:51
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 0de7b03 to 3cb7252 Compare December 10, 2024 06:24
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 003929c to a40e010 Compare December 10, 2024 06:25
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 3cb7252 to 21fc2ec Compare December 10, 2024 23:53
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from a40e010 to 9f2e53a Compare December 10, 2024 23:54
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from 21fc2ec to f2a1799 Compare December 11, 2024 21:04
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 9f2e53a to 538bb93 Compare December 11, 2024 21:04
@andrewgazelka andrewgazelka force-pushed the andrew/connect-intersect-union branch from f2a1799 to 11fefd5 Compare December 11, 2024 21:06
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 538bb93 to a269fea Compare December 11, 2024 21:07
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from a269fea to 456be41 Compare December 11, 2024 21:23
@andrewgazelka andrewgazelka changed the base branch from andrew/connect-intersect-union to main December 11, 2024 21:23
@andrewgazelka andrewgazelka marked this pull request as ready for review December 11, 2024 21:25
@andrewgazelka andrewgazelka changed the title [FEAT] connect: add repartition support feat(connect): add repartition support Dec 11, 2024
@github-actions github-actions bot added the feat label Dec 11, 2024
@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch from 456be41 to 5196ab7 Compare December 11, 2024 22:01
Copy link

codspeed-hq bot commented Dec 11, 2024

CodSpeed Performance Report

Merging #3382 will degrade performances by 32.09%

Comparing andrew/connect-partition (5561e14) with main (07752b8)

Summary

❌ 1 regressions
✅ 26 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main andrew/connect-partition Change
test_show[100 Small Files] 16.1 ms 23.7 ms -32.09%

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 82.75862% with 5 lines in your changes missing coverage. Please review.

Project coverage is 77.86%. Comparing base (07752b8) to head (5561e14).

Files with missing lines Patch % Lines
...onnect/src/translation/logical_plan/repartition.rs 84.61% 4 Missing ⚠️
src/daft-connect/src/translation/logical_plan.rs 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3382   +/-   ##
=======================================
  Coverage   77.86%   77.86%           
=======================================
  Files         719      720    +1     
  Lines       88459    88488   +29     
=======================================
+ Hits        68877    68902   +25     
- Misses      19582    19586    +4     
Files with missing lines Coverage Δ
src/daft-connect/src/translation/logical_plan.rs 68.85% <66.66%> (-0.12%) ⬇️
...onnect/src/translation/logical_plan/repartition.rs 84.61% <84.61%> (ø)

... and 1 file with indirect coverage changes

Comment on lines +28 to +37
let shuffle = shuffle.unwrap_or(true);

if !shuffle {
bail!("Repartitioning without shuffling is not yet supported");
}

plan.builder = plan
.builder
.random_shuffle(Some(num_partitions))
.wrap_err("Failed to apply random_shuffle to logical plan")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should support non shuffle based repartitions too.

in daft, we do this via hash_repartition

Comment on lines +3 to +13
def test_repartition(spark_session):
# Create a simple DataFrame
df = spark_session.range(10)

# Test repartitioning to 2 partitions
repartitioned = df.repartition(2)

# Verify data is preserved after repartitioning
original_data = sorted(df.collect())
repartitioned_data = sorted(repartitioned.collect())
assert repartitioned_data == original_data, "Data should be preserved after repartitioning"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd instead just look at the plan and assert that it has a partition node.

ex: (note: this is using native spark, so our explain will be a little different)

d = [{'idx': 1}]

df = spark.createDataFrame(d).repartition(2)
df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange RoundRobinPartitioning(2), REPARTITION_BY_NUM, [plan_id=6]
   +- Scan ExistingRDD[idx#0L]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this is fine for now considering we don't have explain yet

@andrewgazelka andrewgazelka force-pushed the andrew/connect-partition branch 6 times, most recently from 7d2d9d6 to 655aec2 Compare December 19, 2024 10:14
- [ ] the test is not great but idk how to do it better since rdd does
  not work with spark connect (I think)
- [ ] do we want to support non-shuffle repartitioning?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants