Fix performance bugs in scalar reductions (#509) #543

* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance bugs in scalar reductions (#509) #543

Fix performance bugs in scalar reductions (#509) #543

Commits on Aug 17, 2022