-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coalesce with map_partitions #227
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #227 +/- ##
==========================================
+ Coverage 92.55% 92.57% +0.01%
==========================================
Files 22 22
Lines 1129 1132 +3
==========================================
+ Hits 1045 1048 +3
Misses 84 84
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
src/tape/ensemble.py
Outdated
input_dfs = [] | ||
for col in input_cols: | ||
col_df = df[[col]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Probably can just remove this empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/tape/ensemble.py
Outdated
coal_df = input_dfs[0] | ||
while i < len(input_dfs) - 1: | ||
coal_df = coal_df.combine_first(input_dfs[i + 1]) | ||
i += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not relevant for this specific PR, but we could alternatively
# Combine each dataframe
coal_df = input_dfs.pop()
while input_dfs:
coal_df = coal_df.combine_first(input_dfs.pop())
Using pop(0)
if we care about preserving the current order
This seems a bit more readable to me but up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this looks better to me as well, implemented!
A few weeks ago, the smoke tests failed on the current coalesce function. The initial fix I applied involved resetting the index, which is generally an expensive operation in Dask. This should be a better way to handle things, where we just apply map_partitions to coalesce on a partition-by-partition basis.
Science Driver Impact:
The initial fix to the smoke tests made the coalescing function have issues with the TAPE single pixel dataset for the time-domain MVP, particularly it complained about needing to know the divisions when resetting the index. This new implementation works successfully with the TAPE single-pixel dataset.