Skip to content

Commit

Permalink
Rewrite binary operations for improved performance and additional typ…
Browse files Browse the repository at this point in the history
…e support (#8598)

This PR builds on #8166 to introduce a common code path for binary operations between all Frame types, not just Index and Series. The central component is a new internal API for operations between a sequence of pairs of columns (or column-like objects) that subclasses of Frame can call after performing any preprocessing necessary on the operands. The result is (in addition to the benefits of reducing total code and addressing some tech debt) substantial performance improvements for DataFrame binops, new features in the form of supporting binary operations between a much wider class of data types, and some minor bug fixes that were previously quite challenging to address. 

Resolves #4536.

***Performance***
By avoiding calling through to Series binary operations to perform binops between DataFrames (which requires lots of ColumnAccessor operations, construction of Series objects, and lots of indexing), we gain substantial improvements in DataFrame binops. On average I see at minimum a 20% improvement in binops between DataFrames, with significantly larger improvements (up to 60%) depending on data size, nullability, and other related factors. The performance impact of the new code paths for Series and Index binops is negligible, since the changes effectively amount to a loop over a single element dictionary instead of just operating on that element directly.

***New features***
The centralized binop code made it easy to insert an `as_column` at an appropriate point for the right operand to a binop, which enables support for a wide range of other objects. For instance, it is now possible to add numpy arrays or pandas Series objects to a cudf Series, which was not previously possible. This code also appropriately handles the return of `NotImplemented` for all data types, enabling reflected operations for other libraries that build on `cudf`.

***Bug fixes***
This PR improves the handling of NULL and NaN columns in a number of scenarios, for instance where a column is missing from one input or the other. Additionally, this PR fixes the behavior of RangeIndexes when they are multiplied by integers (pandas returns RangeIndex objects in that case). This fix is largely independent of the new code since it essentially intercepts scalar ints up front and reconstructs a new object.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/brandon-b-miller

URL: #8598
  • Loading branch information
vyasr authored Jul 15, 2021
1 parent ff905a8 commit a2d12b7
Show file tree
Hide file tree
Showing 6 changed files with 420 additions and 417 deletions.
5 changes: 5 additions & 0 deletions python/cudf/cudf/api/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ def is_scalar(val):
return (
isinstance(val, DeviceScalar)
or isinstance(val, cudf.Scalar)
or isinstance(val, cudf.core.tools.datetimes.DateOffset)
or pd_types.is_scalar(val)
)

Expand Down Expand Up @@ -267,3 +268,7 @@ def _union_categoricals(
is_re = pd_types.is_re
is_re_compilable = pd_types.is_re_compilable
is_dtype_equal = pd_types.is_dtype_equal


# Aliases of numpy dtype functionality.
issubdtype = np.issubdtype
Loading

0 comments on commit a2d12b7

Please sign in to comment.