-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Remove calls to remote_len_partition
#1660
Conversation
@@ -934,7 +934,7 @@ class PhysicalPlanScheduler: | |||
def num_partitions(self) -> int: ... | |||
def to_partition_tasks( | |||
self, psets: dict[str, list[PartitionT]], is_ray_runner: bool | |||
) -> physical_plan.MaterializedPhysicalPlan: ... | |||
) -> physical_plan.InProgressPhysicalPlan: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was just typed wrongly from the beginning, not sure why our typechecks didn't catch it.
@@ -251,35 +252,6 @@ def __repr__(self) -> str: | |||
return super().__str__() | |||
|
|||
|
|||
class MaterializedResult(Protocol[PartitionT]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Promoted" and moved to partitioning.py
to avoid circular deps
T = TypeVar("T") | ||
|
||
|
||
# A PhysicalPlan that is still being built - may yield both PartitionTaskBuilders and PartitionTasks. | ||
InProgressPhysicalPlan = Iterator[Union[None, PartitionTask[PartitionT], PartitionTaskBuilder[PartitionT]]] | ||
|
||
# A PhysicalPlan that is complete and will only yield PartitionTasks or final PartitionTs. | ||
MaterializedPhysicalPlan = Iterator[Union[None, PartitionTask[PartitionT], PartitionT]] | ||
MaterializedPhysicalPlan = Iterator[Union[None, PartitionTask[PartitionT], MaterializedResult[PartitionT]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE this important change: instead of yielding PartitionT
(either Table
or ray.ObjectRef
), we now yield the MaterializedResult
container instead.
daft/runners/partitioning.py
Outdated
@@ -92,6 +92,36 @@ def from_table(cls, table: Table) -> PartitionMetadata: | |||
PartitionT = TypeVar("PartitionT") | |||
|
|||
|
|||
@runtime_checkable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incompatible with Python 3.7, need to find a way to remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do from typing_extensions import runtime_checkable
for python 3.7
@@ -126,7 +156,7 @@ def get_partition(self, idx: PartID) -> PartitionT: | |||
raise NotImplementedError() | |||
|
|||
@abstractmethod | |||
def set_partition(self, idx: PartID, part: PartitionT) -> None: | |||
def set_partition(self, idx: PartID, part: MaterializedResult[PartitionT]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_partition
now takes as input a MaterializedResult
instead of a plain old PartitionT
.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1660 +/- ##
==========================================
- Coverage 85.02% 84.88% -0.14%
==========================================
Files 55 55
Lines 5314 5318 +4
==========================================
- Hits 4518 4514 -4
- Misses 796 804 +8
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Can we also verify somehow that get_meta is not being call when running len
for a materialized result in the dataframe
@@ -198,8 +198,9 @@ def iter_partitions(self) -> Iterator[Union[Table, "RayObjectRef"]]: | |||
else: | |||
# Execute the dataframe in a streaming fashion. | |||
context = get_context() | |||
partitions_iter = context.runner().run_iter(self._builder) | |||
yield from partitions_iter | |||
results_iter = context.runner().run_iter(self._builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also may have to handle the if self._result is not None:
case. we can add a test that does
df = df.collect()
for _ in df.iter_partitions():
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR refactors
PartitionSet.len_of_partitions
to avoid usage of ourremote_len_partition
Ray remote function, which has been observed to cause problems when run on dataframes with large amounts of spilling.A few refactors had to be performed to achieve this:
RayPartitionSet
was refactored to holdRayMaterializedResult
objects instead of just rawray.ObjectRef[Table]
..metadata()
method which holds the length of each partition.ray.ObjectRef[Table]
, we can use the.partition()
method which holds the partitionPartitionSet.set_partition
had to be refactored to take as input aMaterializedResult
instead of a plainPartitionT
MaterializedPhysicalPlan
, which now yieldsMaterializedResult[PartitionT]
instead of justPartitionT
, when indicating "done" tasks.