coalesce with map_partitions #227

dougbrn · 2023-09-13T17:26:10Z

A few weeks ago, the smoke tests failed on the current coalesce function. The initial fix I applied involved resetting the index, which is generally an expensive operation in Dask. This should be a better way to handle things, where we just apply map_partitions to coalesce on a partition-by-partition basis.

Science Driver Impact:
The initial fix to the smoke tests made the coalescing function have issues with the TAPE single pixel dataset for the time-domain MVP, particularly it complained about needing to know the divisions when resetting the index. This new implementation works successfully with the TAPE single-pixel dataset.

codecov · 2023-09-13T18:27:29Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01% 🎉

Comparison is base (d0235c3) 92.55% compared to head (710dde3) 92.57%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #227      +/-   ##
==========================================
+ Coverage   92.55%   92.57%   +0.01%     
==========================================
  Files          22       22              
  Lines        1129     1132       +3     
==========================================
+ Hits         1045     1048       +3     
  Misses         84       84

Files Changed	Coverage Δ
src/tape/ensemble.py	`89.46% <100.00%> (+0.06%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wilsonbb

Looks good to me!

wilsonbb · 2023-09-13T19:06:07Z

src/tape/ensemble.py

+            input_dfs = []
+            for col in input_cols:
+                col_df = df[[col]]
+


Nit: Probably can just remove this empty line

wilsonbb · 2023-09-13T19:06:14Z

src/tape/ensemble.py

+            coal_df = input_dfs[0]
+            while i < len(input_dfs) - 1:
+                coal_df = coal_df.combine_first(input_dfs[i + 1])
+                i += 1


Nit: not relevant for this specific PR, but we could alternatively

# Combine each dataframe coal_df = input_dfs.pop() while input_dfs: coal_df = coal_df.combine_first(input_dfs.pop())

Using pop(0) if we care about preserving the current order

This seems a bit more readable to me but up to you

Nice, this looks better to me as well, implemented!

dougbrn added 2 commits September 13, 2023 10:15

coalesce with map_partitions

009a0b9

use dataframes instead of series

c1ed8b7

add descriptive comments

a4c11b6

dougbrn requested a review from wilsonbb September 13, 2023 18:48

wilsonbb approved these changes Sep 13, 2023

View reviewed changes

implement suggestions

710dde3

dougbrn merged commit 06ae6db into main Sep 13, 2023
9 checks passed

dougbrn deleted the map_coalesce branch December 11, 2023 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coalesce with map_partitions #227

coalesce with map_partitions #227

dougbrn commented Sep 13, 2023

codecov bot commented Sep 13, 2023 •

edited

Loading

wilsonbb left a comment

wilsonbb Sep 13, 2023

dougbrn Sep 13, 2023

wilsonbb Sep 13, 2023

dougbrn Sep 13, 2023

coalesce with map_partitions #227

coalesce with map_partitions #227

Conversation

dougbrn commented Sep 13, 2023

codecov bot commented Sep 13, 2023 • edited Loading

Codecov Report

wilsonbb left a comment

Choose a reason for hiding this comment

wilsonbb Sep 13, 2023

Choose a reason for hiding this comment

dougbrn Sep 13, 2023

Choose a reason for hiding this comment

wilsonbb Sep 13, 2023

Choose a reason for hiding this comment

dougbrn Sep 13, 2023

Choose a reason for hiding this comment

codecov bot commented Sep 13, 2023 •

edited

Loading