-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable dask-expr #383
Enable dask-expr #383
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #383 +/- ##
==========================================
+ Coverage 95.55% 95.77% +0.21%
==========================================
Files 25 25
Lines 1710 1751 +41
==========================================
+ Hits 1634 1677 +43
+ Misses 76 74 -2 ☔ View full report in Codecov by Sentry. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor questions and comments but overall this looks really good!
self.update_frame( | ||
self.source._propagate_metadata(result) | ||
) # propagate source metadata and update frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some method we can override in the EnsembleFrame
or some method to register to the dispatcher that can do this for us whenever a user calls concat
? If it's non-obvious, we could file a small issue to look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, maybe? I opted to just do this as this was the only time I've seen concat used, so I wasn't keen on wrapping it just for this. If you know of a better way happy to change, or an issue is also good. This is one of a few cases where I was wondering if we could have a better way to wrap the API where the only needed change is this metadata propagation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"if we could have a better way to wrap the API where the only needed change is this metadata propagation..."
Yeah an issue for this sounds good to me. I'm also a bit frustrated by this
src/tape/ensemble.py
Outdated
result = self.source._propagate_metadata( | ||
result.reset_index().set_index(self._id_col).drop(columns=[tmp_time_col]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know where in this chain we lose the metadata? (groupby, reset_index, set_index, drop, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the drop call, but it could also possibly be the reset_index call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, took a look again and reminded myself. It's actually before any of these calls, right above this theres a groupby->aggregate. The meta is lost when doing this, I think technically at the groupby stage as it returns a dask groupby object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, looks good to me!
Change Description
Resolves #382.
Solution Description
This PR enables the use of the recently introduced dask-expr backend to dask with TAPE. For the most part things just needed small tweaks, but there's a couple things that stood out:
set_index
but we should watch out for any cases where divisions are not being set. It seemed more finicky for single partition tables as well, so I updated a few unit tests to produce tables with >1 division.Code Quality
Project-Specific Pull Request Checklists
Bug Fix Checklist
New Feature Checklist
Documentation Change Checklist
Build/CI Change Checklist
Other Change Checklist