Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Series.__array_ufunc__ #10217

Merged
merged 17 commits into from
Feb 8, 2022

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Feb 4, 2022

This PR is a first step in addressing #9083. It rewrites Series ufunc dispatch to use a simpler dispatch pattern that removes lookup in the overall cudf namespace, restricting only to Series methods or cupy functions. The changes in this PR also enable proper support for indexing so that ufuncs between two Series objects will work correctly and preserve indexes. Additionally, ufuncs will now support Series that contain nulls, which was not previously the case. Series methods that don't exist in pandas.Series that were previously only necessary to support ufunc dispatch have now been removed. Once this PR is merged, I will work on generalizing this approach to also support DataFrame objects, which should enable full support for ufuncs as well as allowing us to deprecate significant chunks of existing code.

For the cases that previously worked, this PR does have some performance implications. Any operator that dispatches to cupy is about 3x faster now (assuming no nulls, since the nullable case was not previously supported), with the exception of logical operators for which we previously defined functions in the Series namespace that do not have pandas analogs. I've made a note in the code that we could reintroduce internal versions of these just for ufunc dispatch if that slowdown becomes a bottleneck for users, but for now I would prefer to avoid any more special cases than we really need.

@vyasr vyasr added feature request New feature or request 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 4, 2022
@vyasr vyasr added this to the CuDF Python Refactoring milestone Feb 4, 2022
@vyasr vyasr self-assigned this Feb 4, 2022
@vyasr vyasr requested a review from a team as a code owner February 4, 2022 00:57
@vyasr vyasr force-pushed the refactor/series_array_ufunc branch from e9b8f0f to 089caf3 Compare February 7, 2022 19:47
@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #10217 (089caf3) into branch-22.04 (a7d88cd) will increase coverage by 0.02%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10217      +/-   ##
================================================
+ Coverage         10.42%   10.44%   +0.02%     
================================================
  Files               119      122       +3     
  Lines             20603    20550      -53     
================================================
- Hits               2148     2147       -1     
+ Misses            18455    18403      -52     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7987791...089caf3. Read the comment docs.

@vyasr
Copy link
Contributor Author

vyasr commented Feb 8, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6e267bd into rapidsai:branch-22.04 Feb 8, 2022
rapids-bot bot pushed a commit that referenced this pull request Feb 15, 2022
This PR addresses the primary issue in #9083, enabling all numpy ufuncs for DataFrame objects. It builds on the work in #10217, generalizing that code path to support multiple columns and moving the method up to `IndexedFrame` to share the logic with `DataFrame`. The custom preprocessing of inputs before handing off to cupy that was implemented in #10217 has been replaced by reusing parts of the existing binop machinery for greater generality, which is especially important for DataFrame binops since they support a wider range of alternative operand types. The current internal refactor is intentionally minimal to leave the focus on the new ufunc features. I will make a follow-up to clean up the internal functions by adding a proper set of hooks into the binop and ufunc implementations so that we can share these implementations with Index types as well, at which point we will be able to remove the extraneous APIs discussed in #9083 (comment).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #10287
rapids-bot bot pushed a commit that referenced this pull request Feb 25, 2022
This PR builds on #10217 and #10287 to bring full ufunc support for Index types, expanding well beyond the small set previously supported in the `cudf.core.ops` namespace. By using most of the machinery introduced for IndexedFrame in the prior two PRs we avoid duplicating much logic so that all ufunc dispatches flow through a relatively standard path of known methods prior to a common cupy dispatch. With this change we are also able to deprecate the various ufunc operations defined in cudf/core/ops.py that exist only for this purpose as well as a number of Frame methods that are not defined for the corresponding pandas types. Users of those APIs are recommended to calling the corresponding numpy/cupy ufuncs instead to leverage the new dispatch.

This PR also fixes a bug where index binary operations that output booleans would previously return instances of GenericIndex, whereas those pandas operations would return numpy arrays. cudf now returns cupy arrays in those cases.

Resolves #9083. Contributes to #9038.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10346
@vyasr vyasr mentioned this pull request Feb 25, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 2, 2022
This PR cleans up the implementation of `__array_function__` for `Series` and `DataFrame` to bring them further into alignment. It also inlines a number of functions defined in `utils/utils.py` that were previously used only in `Series.__array_ufunc__`, building on the improvements in #10217, #10287, and #10346 to clear out methods related to the old `__array_ufunc__` dispatch that are now only used by this `__array_function__` implementation. Inlining these methods also allows significant simplification since they were handling cases that are no longer relevant or possible. Unlike those previous PRs, this one does not actually enable any new features. Although it should marginally accelerate array functions by simplifying the dispatch logic, the fact that this API makes few promises about the nature of the function being applied and our desire to have it "just work" as much as possible means that we must simply adopt an EAFP approach and return `NotImplemented` if any part of the process fails.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #10364
@vyasr vyasr deleted the refactor/series_array_ufunc branch March 9, 2022 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants