-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move the binary_ops common dispatcher logic to be executed on the CPU #9816
Move the binary_ops common dispatcher logic to be executed on the CPU #9816
Conversation
Performance comparison:
|
Binary size change is an increase of 80KB ( from |
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9816 +/- ##
=============================================
Coverage 10.60% 10.60%
=============================================
Files 118 118
Lines 20081 20081
=============================================
Hits 2130 2130
Misses 17951 17951 Continue to review full report at Codecov.
|
Awesome. Surprised that binary size didn't increase much! Following operations still take longer time for 100M size.
JIT and 21.08 compiled binops are in 3ms range for 100M size. We need to add benchmark for both common_type combination and non common type combination to benchmark both kernels now. |
bool is_rhs_scalar, | ||
binary_operator op, | ||
rmm::cuda_stream_view stream) | ||
__forceinline__ void operator_dispatcher(mutable_column_device_view& out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested with inline instead of force inline? We should prefer standard keywords wherever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I specifically requested this as the regression is caused by a lack of inlining due to crossing a threshold of complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the wrong place to put it because this is host code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the force_inline to the correct place and re-running the benchmarks.
@abellina and others ran through a large set of performance tests for the Spark plugin yesterday and last night. This patch improves the performance quite a bit vs the 21.10 release. We are seeing a median improvement in performance for the queries we ran of about 7.4%. This is great because the same tests run against #9802 showed about a 0.8% improvement. That is a great win. The tests we ran last night were mostly targeting the non-decimal use case, as we were concerned about regressions in performance there. We will likely be doing more performance tests today, but overall this is looking really good. |
Compile time difference: +4s only! 👍 old - touch binary_op.cuh |
Update performance numbers comparing this PR with and without forceinline. In short we see the performance regression for For file size we see a reduction of the binary by
|
Benchmark Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Will initiate PR for templated benchmark separately to make things easier.
Fixes #9813
This builds on the work of #9802, and address #9813
In short we see significantly better performance for
IMBALANCED
,ADD
,SUB
,MUL
, andDIV
with no significant change in other binary_ops