Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coalesce with map_partitions #227

Merged
merged 4 commits into from
Sep 13, 2023
Merged

coalesce with map_partitions #227

merged 4 commits into from
Sep 13, 2023

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Sep 13, 2023

A few weeks ago, the smoke tests failed on the current coalesce function. The initial fix I applied involved resetting the index, which is generally an expensive operation in Dask. This should be a better way to handle things, where we just apply map_partitions to coalesce on a partition-by-partition basis.

Science Driver Impact:
The initial fix to the smoke tests made the coalescing function have issues with the TAPE single pixel dataset for the time-domain MVP, particularly it complained about needing to know the divisions when resetting the index. This new implementation works successfully with the TAPE single-pixel dataset.

@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01% 🎉

Comparison is base (d0235c3) 92.55% compared to head (710dde3) 92.57%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #227      +/-   ##
==========================================
+ Coverage   92.55%   92.57%   +0.01%     
==========================================
  Files          22       22              
  Lines        1129     1132       +3     
==========================================
+ Hits         1045     1048       +3     
  Misses         84       84              
Files Changed Coverage Δ
src/tape/ensemble.py 89.46% <100.00%> (+0.06%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dougbrn dougbrn requested a review from wilsonbb September 13, 2023 18:48
Copy link
Collaborator

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

input_dfs = []
for col in input_cols:
col_df = df[[col]]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Probably can just remove this empty line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

coal_df = input_dfs[0]
while i < len(input_dfs) - 1:
coal_df = coal_df.combine_first(input_dfs[i + 1])
i += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not relevant for this specific PR, but we could alternatively

        # Combine each dataframe
        coal_df = input_dfs.pop()
        while input_dfs:
            coal_df = coal_df.combine_first(input_dfs.pop())

Using pop(0) if we care about preserving the current order

This seems a bit more readable to me but up to you

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks better to me as well, implemented!

@dougbrn dougbrn merged commit 06ae6db into main Sep 13, 2023
9 checks passed
@dougbrn dougbrn deleted the map_coalesce branch December 11, 2023 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants