Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dask-expr #383

Merged
merged 20 commits into from
Mar 27, 2024
Merged

Enable dask-expr #383

merged 20 commits into from
Mar 27, 2024

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Mar 11, 2024

Change Description

Resolves #382.

  • My PR includes a link to the issue that I am addressing

Solution Description

This PR enables the use of the recently introduced dask-expr backend to dask with TAPE. For the most part things just needed small tweaks, but there's a couple things that stood out:

  • _meta and divisions properties are now protected, so I had to update cases where we were directly setting those. In the case of divisions, I opted to just let Dask set it's divisions without our intervention as in some cases we were trying to use the source divisions for batch results, but the source divisions should be what batch finds regardless (as it's a groupby on the source table).
  • divisions were a bit finicky to be set, I believe I tracked this down to a kwarg change in set_index but we should watch out for any cases where divisions are not being set. It seemed more finicky for single partition tables as well, so I updated a few unit tests to produce tables with >1 division.
  • This PR switches TAPE to support only the dask-expr backend. I am not sure how much would break of the user tried the old backend, but I've included some checks in initialization to force the use of dask-expr. We should release a minor version of TAPE (v0.4.0) once this is included to easily denote the break point between the two backends.

Code Quality

  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

  • I have added a function that requires a sync_tables command, and have added the neccesary sync_tables call

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

Build/CI Change Checklist

  • If required or optional dependencies have changed (including version numbers), I have updated the README to reflect this
  • If this is a new CI setup, I have added the associated badge to the README

Other Change Checklist

  • Any new or updated docstrings use the NumPy docstring format.
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover any changes
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Copy link

github-actions bot commented Mar 11, 2024

Before [dae414b] After [caea38f] Ratio Benchmark (Parameter)
44.6±0.5ms 32.6±0.2ms 0.73 benchmarks.time_batch
47.3±0.1ms 33.5±0.1ms 0.71 benchmarks.time_prune_sync_workflow

Click here to view all benchmarks.

Copy link

codecov bot commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 97.11538% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 95.77%. Comparing base (dae414b) to head (4ee4856).

Files Patch % Lines
src/tape/ensemble_frame.py 95.94% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #383      +/-   ##
==========================================
+ Coverage   95.55%   95.77%   +0.21%     
==========================================
  Files          25       25              
  Lines        1710     1751      +41     
==========================================
+ Hits         1634     1677      +43     
+ Misses         76       74       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

src/tape/ensemble_frame.py Outdated Show resolved Hide resolved
@dougbrn dougbrn changed the title WIP: Enable dask-expr Enable dask-expr Mar 26, 2024
@dougbrn dougbrn marked this pull request as ready for review March 26, 2024 20:32
@dougbrn dougbrn requested a review from wilsonbb March 26, 2024 20:37
Copy link
Collaborator

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor questions and comments but overall this looks really good!

Comment on lines +342 to +344
self.update_frame(
self.source._propagate_metadata(result)
) # propagate source metadata and update frame
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some method we can override in the EnsembleFrame or some method to register to the dispatcher that can do this for us whenever a user calls concat? If it's non-obvious, we could file a small issue to look into it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, maybe? I opted to just do this as this was the only time I've seen concat used, so I wasn't keen on wrapping it just for this. If you know of a better way happy to change, or an issue is also good. This is one of a few cases where I was wondering if we could have a better way to wrap the API where the only needed change is this metadata propagation...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if we could have a better way to wrap the API where the only needed change is this metadata propagation..."

Yeah an issue for this sounds good to me. I'm also a bit frustrated by this

Comment on lines 1043 to 1045
result = self.source._propagate_metadata(
result.reset_index().set_index(self._id_col).drop(columns=[tmp_time_col])
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know where in this chain we lose the metadata? (groupby, reset_index, set_index, drop, etc)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the drop call, but it could also possibly be the reset_index call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, took a look again and reminded myself. It's actually before any of these calls, right above this theres a groupby->aggregate. The meta is lost when doing this, I think technically at the groupby stage as it returns a dask groupby object.

tests/tape_tests/test_ensemble.py Outdated Show resolved Hide resolved
tests/tape_tests/test_ensemble_frame.py Outdated Show resolved Hide resolved
src/tape/ensemble_frame.py Show resolved Hide resolved
src/tape/ensemble_frame.py Outdated Show resolved Hide resolved
src/tape/ensemble_frame.py Outdated Show resolved Hide resolved
src/tape/ensemble_frame.py Outdated Show resolved Hide resolved
src/tape/ensemble_frame.py Show resolved Hide resolved
@dougbrn dougbrn requested a review from wilsonbb March 27, 2024 20:20
Copy link
Collaborator

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks good to me!

@dougbrn dougbrn merged commit 750fe4b into main Mar 27, 2024
10 checks passed
@dougbrn dougbrn deleted the use_dask_expr branch April 4, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate Switch to Dask Expressions
2 participants