[PERF] Remove calls to `remote_len_partition` #1660

jaychia · 2023-11-22T03:08:49Z

This PR refactors PartitionSet.len_of_partitions to avoid usage of our remote_len_partition Ray remote function, which has been observed to cause problems when run on dataframes with large amounts of spilling.

A few refactors had to be performed to achieve this:

The RayPartitionSet was refactored to hold RayMaterializedResult objects instead of just raw ray.ObjectRef[Table].
- This allows us to access the .metadata() method which holds the length of each partition.
- To access the ray.ObjectRef[Table], we can use the .partition() method which holds the partition
As part of (1), PartitionSet.set_partition had to be refactored to take as input a MaterializedResult instead of a plain PartitionT
On the execution end, we refactored the code mainly around MaterializedPhysicalPlan, which now yields MaterializedResult[PartitionT] instead of just PartitionT, when indicating "done" tasks.

jaychia · 2023-11-22T03:10:00Z

daft/daft.pyi

@@ -934,7 +934,7 @@ class PhysicalPlanScheduler:
    def num_partitions(self) -> int: ...
    def to_partition_tasks(
        self, psets: dict[str, list[PartitionT]], is_ray_runner: bool
-    ) -> physical_plan.MaterializedPhysicalPlan: ...
+    ) -> physical_plan.InProgressPhysicalPlan: ...


I think this was just typed wrongly from the beginning, not sure why our typechecks didn't catch it.

jaychia · 2023-11-22T03:10:30Z

daft/execution/execution_step.py

@@ -251,35 +252,6 @@ def __repr__(self) -> str:
        return super().__str__()


-class MaterializedResult(Protocol[PartitionT]):


"Promoted" and moved to partitioning.py to avoid circular deps

jaychia · 2023-11-22T03:11:23Z

daft/execution/physical_plan.py

 T = TypeVar("T")


 # A PhysicalPlan that is still being built - may yield both PartitionTaskBuilders and PartitionTasks.
 InProgressPhysicalPlan = Iterator[Union[None, PartitionTask[PartitionT], PartitionTaskBuilder[PartitionT]]]

 # A PhysicalPlan that is complete and will only yield PartitionTasks or final PartitionTs.
-MaterializedPhysicalPlan = Iterator[Union[None, PartitionTask[PartitionT], PartitionT]]
+MaterializedPhysicalPlan = Iterator[Union[None, PartitionTask[PartitionT], MaterializedResult[PartitionT]]]


NOTE this important change: instead of yielding PartitionT (either Table or ray.ObjectRef), we now yield the MaterializedResult container instead.

jaychia · 2023-11-22T03:13:32Z

daft/runners/partitioning.py

@@ -92,6 +92,36 @@ def from_table(cls, table: Table) -> PartitionMetadata:
 PartitionT = TypeVar("PartitionT")


+@runtime_checkable


This is incompatible with Python 3.7, need to find a way to remove it

I think you can do from typing_extensions import runtime_checkable for python 3.7

jaychia · 2023-11-22T03:15:48Z

daft/runners/partitioning.py

@@ -126,7 +156,7 @@ def get_partition(self, idx: PartID) -> PartitionT:
        raise NotImplementedError()

    @abstractmethod
-    def set_partition(self, idx: PartID, part: PartitionT) -> None:
+    def set_partition(self, idx: PartID, part: MaterializedResult[PartitionT]) -> None:


set_partition now takes as input a MaterializedResult instead of a plain old PartitionT.

codecov · 2023-11-22T03:38:14Z

Codecov Report

Merging #1660 (386f912) into main (55a95ab) will decrease coverage by 0.14%.
Report is 5 commits behind head on main.
The diff coverage is 89.18%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1660      +/-   ##
==========================================
- Coverage   85.02%   84.88%   -0.14%     
==========================================
  Files          55       55              
  Lines        5314     5318       +4     
==========================================
- Hits         4518     4514       -4     
- Misses        796      804       +8

Files	Coverage Δ
daft/dataframe/dataframe.py	`86.07% <100.00%> (-1.14%)`	⬇️
daft/execution/execution_step.py	`92.70% <100.00%> (+1.04%)`	⬆️
daft/execution/physical_plan.py	`93.35% <100.00%> (-0.03%)`	⬇️
daft/execution/rust_physical_plan_shim.py	`98.59% <100.00%> (-0.02%)`	⬇️
daft/runners/pyrunner.py	`96.95% <100.00%> (-0.02%)`	⬇️
daft/runners/runner.py	`80.76% <100.00%> (-0.72%)`	⬇️
daft/runners/runner_io.py	`86.20% <100.00%> (-0.46%)`	⬇️
daft/runners/ray_runner.py	`89.59% <86.95%> (-0.16%)`	⬇️
daft/runners/partitioning.py	`80.45% <70.58%> (-1.22%)`	⬇️

... and 1 file with indirect coverage changes

samster25

Looks great! Can we also verify somehow that get_meta is not being call when running len for a materialized result in the dataframe

samster25 · 2023-11-22T03:46:09Z

daft/dataframe/dataframe.py

@@ -198,8 +198,9 @@ def iter_partitions(self) -> Iterator[Union[Table, "RayObjectRef"]]:
        else:
            # Execute the dataframe in a streaming fashion.
            context = get_context()
-            partitions_iter = context.runner().run_iter(self._builder)
-            yield from partitions_iter
+            results_iter = context.runner().run_iter(self._builder)


I think you also may have to handle the if self._result is not None: case. we can add a test that does

df = df.collect() for _ in df.iter_partitions(): pass

if self._result is not None: # If the dataframe has already finished executing, # use the precomputed results. yield from self._result.values()

This block is still safe, because PartitionSet.values() still returns partitions (not MaterializedResult)

daft/execution/physical_plan.py

daft/runners/ray_runner.py

Jay Chia added 5 commits November 21, 2023 18:08

Init: set partitions with MaterializedResult

4026d56

Fix RayPartitionSet code to use ._results instead of ._partitions

115d379

PyRunner passing tests

7a2a784

More fixes for ray runner

f0119a8

Remove the remote_len ray remote function

58e7e7c

github-actions bot added the performance label Nov 22, 2023

jaychia commented Nov 22, 2023

View reviewed changes

Use ABCs instead of Protocol for 3.7 compatibility

6f890d8

samster25 approved these changes Nov 22, 2023

View reviewed changes

Jay Chia added 3 commits November 21, 2023 20:17

Skip getting the metadata for a child of a read node

30ea7ee

Add .result() method on SingleOutputPartitionTask

2843588

Drop len_of_partitions

386f912

jaychia enabled auto-merge (squash) November 22, 2023 04:39

jaychia merged commit ff218e7 into main Nov 22, 2023
39 checks passed

jaychia deleted the jay/remote-len-refactor branch November 22, 2023 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Remove calls to `remote_len_partition` #1660

[PERF] Remove calls to `remote_len_partition` #1660

jaychia commented Nov 22, 2023 •

edited

Loading

jaychia Nov 22, 2023

jaychia Nov 22, 2023

jaychia Nov 22, 2023

jaychia Nov 22, 2023

samster25 Nov 22, 2023

jaychia Nov 22, 2023

codecov bot commented Nov 22, 2023 •

edited

Loading

samster25 left a comment

samster25 Nov 22, 2023

jaychia Nov 22, 2023

		@@ -251,35 +252,6 @@ def __repr__(self) -> str:
		return super().__str__()


		class MaterializedResult(Protocol[PartitionT]):

		@@ -92,6 +92,36 @@ def from_table(cls, table: Table) -> PartitionMetadata:
		PartitionT = TypeVar("PartitionT")


		@runtime_checkable

[PERF] Remove calls to remote_len_partition #1660

[PERF] Remove calls to remote_len_partition #1660

Conversation

jaychia commented Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 22, 2023 • edited Loading

Codecov Report

samster25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[PERF] Remove calls to `remote_len_partition` #1660

[PERF] Remove calls to `remote_len_partition` #1660

jaychia commented Nov 22, 2023 •

edited

Loading

codecov bot commented Nov 22, 2023 •

edited

Loading