Rewrite binary operations for improved performance and additional typ… · rapidsai/cudf@a2d12b7

Commit

Rewrite binary operations for improved performance and additional typ…

…e support (#8598)

This PR builds on #8166 to introduce a common code path for binary operations between all Frame types, not just Index and Series. The central component is a new internal API for operations between a sequence of pairs of columns (or column-like objects) that subclasses of Frame can call after performing any preprocessing necessary on the operands. The result is (in addition to the benefits of reducing total code and addressing some tech debt) substantial performance improvements for DataFrame binops, new features in the form of supporting binary operations between a much wider class of data types, and some minor bug fixes that were previously quite challenging to address. 

Resolves #4536.

***Performance***
By avoiding calling through to Series binary operations to perform binops between DataFrames (which requires lots of ColumnAccessor operations, construction of Series objects, and lots of indexing), we gain substantial improvements in DataFrame binops. On average I see at minimum a 20% improvement in binops between DataFrames, with significantly larger improvements (up to 60%) depending on data size, nullability, and other related factors. The performance impact of the new code paths for Series and Index binops is negligible, since the changes effectively amount to a loop over a single element dictionary instead of just operating on that element directly.

***New features***
The centralized binop code made it easy to insert an `as_column` at an appropriate point for the right operand to a binop, which enables support for a wide range of other objects. For instance, it is now possible to add numpy arrays or pandas Series objects to a cudf Series, which was not previously possible. This code also appropriately handles the return of `NotImplemented` for all data types, enabling reflected operations for other libraries that build on `cudf`.

***Bug fixes***
This PR improves the handling of NULL and NaN columns in a number of scenarios, for instance where a column is missing from one input or the other. Additionally, this PR fixes the behavior of RangeIndexes when they are multiplied by integers (pandas returns RangeIndex objects in that case). This fix is largely independent of the new code since it essentially intercepts scalar ints up front and reconstructs a new object.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/brandon-b-miller

URL: #8598

Loading branch information

vyasr authored Jul 15, 2021

1 parent ff905a8 commit a2d12b7

python/cudf/cudf/api/types.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -126,6 +126,7 @@ def is_scalar(val): @@
         return (
             isinstance(val, DeviceScalar)
             or isinstance(val, cudf.Scalar)
+            or isinstance(val, cudf.core.tools.datetimes.DateOffset)
             or pd_types.is_scalar(val)
         )
@@ Expand Down Expand Up / @@ -267,3 +268,7 @@ def _union_categoricals( @@
     is_re = pd_types.is_re
     is_re_compilable = pd_types.is_re_compilable
     is_dtype_equal = pd_types.is_dtype_equal
+    # Aliases of numpy dtype functionality.
+    issubdtype = np.issubdtype

0 comments on commit `a2d12b7`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `a2d12b7`

Commit

There are no files selected for viewing

0 comments on commit a2d12b7

0 comments on commit `a2d12b7`