-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix performance bugs in scalar reductions #509
Fix performance bugs in scalar reductions #509
Conversation
* Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently.
Do you you want to also replace the use of |
|
CHECK_CUDA(cudaMemcpyAsync(ptr_, &identity, sizeof(LHS), cudaMemcpyHostToDevice, stream)); | ||
} | ||
|
||
__device__ void operator<<=(const RHS& value) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: obviously these things come down to preference and this is just matching legion, but I would personally suggest writing this out as a function name rather than overloading an operator. This appears be doing an atomic reduce. The reduce_output helper function was a little bit difficult to parse with the <<= (a bit-shift operator borrowed for a different purpose) instead of just having a function call say exactly what the code is doing (result.non_exclusive_fold, e.g.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using RHS = typename REDOP::RHS; | ||
|
||
public: | ||
ScalarReductionBuffer(cudaStream_t stream) : buffer_(legate::create_buffer<LHS>(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the class name obviously gets annoying long, but consider calling this 'DeviceScalarReductionBuffer' to make it clear this is not a general reduction buffer and is only designed for device reductions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using RHS = typename REDOP::RHS; | ||
|
||
public: | ||
ScalarReductionBuffer(cudaStream_t stream) : buffer_(legate::create_buffer<LHS>(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is again only going to run on the device, do we want to explicitly pass GPU_FB_MEM to create_buffer to make it clearer what is happening? Otherwise this is using the default kind = NO_MEMKIND, which seems potentially fragile to rely on get_executing_processor() returning TOC_PROC to allocate this in the right place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter
* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter
* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter Co-authored-by: Wonchan Lee <[email protected]>
No description provided.