Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rewrite binary operations for improved performance and additional typ…
…e support (#8598) This PR builds on #8166 to introduce a common code path for binary operations between all Frame types, not just Index and Series. The central component is a new internal API for operations between a sequence of pairs of columns (or column-like objects) that subclasses of Frame can call after performing any preprocessing necessary on the operands. The result is (in addition to the benefits of reducing total code and addressing some tech debt) substantial performance improvements for DataFrame binops, new features in the form of supporting binary operations between a much wider class of data types, and some minor bug fixes that were previously quite challenging to address. Resolves #4536. ***Performance*** By avoiding calling through to Series binary operations to perform binops between DataFrames (which requires lots of ColumnAccessor operations, construction of Series objects, and lots of indexing), we gain substantial improvements in DataFrame binops. On average I see at minimum a 20% improvement in binops between DataFrames, with significantly larger improvements (up to 60%) depending on data size, nullability, and other related factors. The performance impact of the new code paths for Series and Index binops is negligible, since the changes effectively amount to a loop over a single element dictionary instead of just operating on that element directly. ***New features*** The centralized binop code made it easy to insert an `as_column` at an appropriate point for the right operand to a binop, which enables support for a wide range of other objects. For instance, it is now possible to add numpy arrays or pandas Series objects to a cudf Series, which was not previously possible. This code also appropriately handles the return of `NotImplemented` for all data types, enabling reflected operations for other libraries that build on `cudf`. ***Bug fixes*** This PR improves the handling of NULL and NaN columns in a number of scenarios, for instance where a column is missing from one input or the other. Additionally, this PR fixes the behavior of RangeIndexes when they are multiplied by integers (pandas returns RangeIndex objects in that case). This fix is largely independent of the new code since it essentially intercepts scalar ints up front and reconstructs a new object. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/brandon-b-miller URL: #8598
- Loading branch information