-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coalesce stores in Slice for smaller output types #3568
Conversation
This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
dali/kernels/slice/slice_gpu.cuh
Outdated
@@ -72,6 +72,9 @@ struct SliceBlockDesc { | |||
uint64_t size; | |||
}; | |||
|
|||
template<typename OutputType> | |||
constexpr int coalesced_pixels = sizeof(OutputType) >= 4 ? 1 : 4 / sizeof(OutputType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: perhaps it should be "values", not "pixels". An RGB pixel consists of 3 values and I don't think this is recognized anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this is a bit confusing. Renamed to "values"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work!
You can consider the naming nitpick, but otherwise it looks good.
!build |
CI MESSAGE: [3588408]: BUILD FAILED |
CI MESSAGE: [3588408]: BUILD STARTED |
Signed-off-by: Szymon Karpiński <[email protected]>
Clang fails to unroll a loop (the outer loop, I guess). There are two options:
|
Signed-off-by: Szymon Karpiński <[email protected]>
8fb1a8e
to
58ea73d
Compare
CI MESSAGE: [3589111]: BUILD STARTED |
CI MESSAGE: [3589111]: BUILD PASSED |
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
* Coalesce stores in Slice for smaller output types This change coalesces stores to global memory in SliceGPU when OutputType is smaller than 4 bytes in order to improve performance. Signed-off-by: Szymon Karpiński <[email protected]>
Description
What happened in this PR
Note: The diff presented by GitHub looks terrible and huge. Use
Hide whitespace
option to see the real changes, and not the indent change ;)Currently, each thread in a block of
SliceGPU
kernel is responsible for processing and storing pixels at indices spaced byBlockDim.x
. WhenOutputType
is small, for example isuint8
, this results in single-byte stores from each thread.This proposal makes each thread process few subsequent pixels, so that the stores can be easily coalesced to a full 32-bit global memory access.
This works as follows. A number of
OutputType
s fitting in a 32-bit word is computed (in compile time) ascoalesced_pixels
. Then, on each step, a thread processesscoalesced_pixels
pixels instead of one, and then moves right byBlockDim.x * coalesced_pixels
indices instead ofBlockDim.x
. Those stores to subsequent locations (should) get coealesced to a single request by the GPU, thus reducing total number of global memory requests issued.This significantly improves Slice's performance for arrays of
uint8
s. The plots below present the throughput improvement in cropping 2000x2000 RGBuint8
images to 1000x1000 on Titan V. As you can see, the throughput improvement is up to 40%.Additional information
OutputType
is smaller than 4 bytes.Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A