-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge Index and Series binops #8166
Conversation
…aFrame." This reverts commit a49f3748c7d19fa39181df05549c7a7dcced4a9a.
…ently still a pass-through to Series).
Here's a snapshot of the performance improvements for doing binary operations. Each line should be compared to the line directly below it so that the only difference is whether it's
|
Codecov Report
@@ Coverage Diff @@
## branch-0.20 #8166 +/- ##
===============================================
- Coverage 82.88% 82.86% -0.03%
===============================================
Files 103 104 +1
Lines 17668 17884 +216
===============================================
+ Hits 14645 14820 +175
- Misses 3023 3064 +41
Continue to review full report at Codecov.
|
@gpucibot merge |
Continuation of #8115 and #8166. Moves more logic out of the Index/Series classes into the new common parent class to reduce code duplication and ensure feature parity. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - https://github.com/brandon-b-miller - Michael Wang (https://github.com/isVoid) URL: #8253
…e support (#8598) This PR builds on #8166 to introduce a common code path for binary operations between all Frame types, not just Index and Series. The central component is a new internal API for operations between a sequence of pairs of columns (or column-like objects) that subclasses of Frame can call after performing any preprocessing necessary on the operands. The result is (in addition to the benefits of reducing total code and addressing some tech debt) substantial performance improvements for DataFrame binops, new features in the form of supporting binary operations between a much wider class of data types, and some minor bug fixes that were previously quite challenging to address. Resolves #4536. ***Performance*** By avoiding calling through to Series binary operations to perform binops between DataFrames (which requires lots of ColumnAccessor operations, construction of Series objects, and lots of indexing), we gain substantial improvements in DataFrame binops. On average I see at minimum a 20% improvement in binops between DataFrames, with significantly larger improvements (up to 60%) depending on data size, nullability, and other related factors. The performance impact of the new code paths for Series and Index binops is negligible, since the changes effectively amount to a loop over a single element dictionary instead of just operating on that element directly. ***New features*** The centralized binop code made it easy to insert an `as_column` at an appropriate point for the right operand to a binop, which enables support for a wide range of other objects. For instance, it is now possible to add numpy arrays or pandas Series objects to a cudf Series, which was not previously possible. This code also appropriately handles the return of `NotImplemented` for all data types, enabling reflected operations for other libraries that build on `cudf`. ***Bug fixes*** This PR improves the handling of NULL and NaN columns in a number of scenarios, for instance where a column is missing from one input or the other. Additionally, this PR fixes the behavior of RangeIndexes when they are multiplied by integers (pandas returns RangeIndex objects in that case). This fix is largely independent of the new code since it essentially intercepts scalar ints up front and reconstructs a new object. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/brandon-b-miller URL: #8598
This PR builds on #8115, moving all binary operations from
Index
andSeries
into theSingleColumnFrame
class. It really should be a negative LOC change, but it doesn't look like it for two reasons: 1)Index
objects require some special handling due to the awkwardness of needing to return the right type ofIndex
, which is frequently not the type that is being operated on (e.g.RangeIndex + RangeIndex
results in anInt64Index
), and that's something we'll want to refactor in a future PR, and 2) I've added a significant number of comments both in the form of docstrings and to give context for the issues arising from (1). This PR also significantly speeds up all binary operations forIndex
objects because it removes the round-tripping of data fromIndex->Series->Index
that was previously being done to implement binary operations. The percent speedup depends on how expensive the operation itself is, but having tested for a number of data sizes it is >=15%, ranging up to 40% for simpler operations like__ne__
. Benchmarks to follow.