Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalesce stores in Slice for smaller output types #3568

Merged
merged 3 commits into from
Dec 14, 2021

Conversation

szkarpinski
Copy link
Collaborator

@szkarpinski szkarpinski commented Dec 13, 2021

Description

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring (Redesign of existing code that doesn't affect functionality)
  • Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

Note: The diff presented by GitHub looks terrible and huge. Use Hide whitespace option to see the real changes, and not the indent change ;)

Currently, each thread in a block of SliceGPU kernel is responsible for processing and storing pixels at indices spaced by BlockDim.x. When OutputType is small, for example is uint8, this results in single-byte stores from each thread.

This proposal makes each thread process few subsequent pixels, so that the stores can be easily coalesced to a full 32-bit global memory access.

This works as follows. A number of OutputTypes fitting in a 32-bit word is computed (in compile time) as coalesced_pixels. Then, on each step, a thread processess coalesced_pixels pixels instead of one, and then moves right by BlockDim.x * coalesced_pixels indices instead of BlockDim.x. Those stores to subsequent locations (should) get coealesced to a single request by the GPU, thus reducing total number of global memory requests issued.

This significantly improves Slice's performance for arrays of uint8s. The plots below present the throughput improvement in cropping 2000x2000 RGB uint8 images to 1000x1000 on Titan V. As you can see, the throughput improvement is up to 40%.

image

Additional information

  • Affected modules and functionalities: SliceGPU kernel's performance when OutputType is smaller than 4 bytes.

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
@szkarpinski szkarpinski marked this pull request as ready for review December 13, 2021 22:54
@szkarpinski szkarpinski marked this pull request as draft December 13, 2021 22:59
@JanuszL JanuszL marked this pull request as ready for review December 14, 2021 09:11
@mzient mzient assigned mzient and jantonguirao and unassigned stiepan Dec 14, 2021
@mzient mzient assigned mzient and unassigned banasraf Dec 14, 2021
@@ -72,6 +72,9 @@ struct SliceBlockDesc {
uint64_t size;
};

template<typename OutputType>
constexpr int coalesced_pixels = sizeof(OutputType) >= 4 ? 1 : 4 / sizeof(OutputType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: perhaps it should be "values", not "pixels". An RGB pixel consists of 3 values and I don't think this is recognized anyhow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is a bit confusing. Renamed to "values"

Copy link
Contributor

@mzient mzient left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work!
You can consider the naming nitpick, but otherwise it looks good.

@jantonguirao
Copy link
Contributor

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3588408]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3588408]: BUILD STARTED

@mzient
Copy link
Contributor

mzient commented Dec 14, 2021

Clang fails to unroll a loop (the outer loop, I guess). There are two options:

  1. Disable the warning about failed unrolling in clang-only build (-Wpass-failed=transform-warning)
  2. Put the outer #pragma unroll in a conditional compilation block and skip it in clang-only builds.

Signed-off-by: Szymon Karpiński <[email protected]>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3589111]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3589111]: BUILD PASSED

@mzient mzient merged commit 740d626 into NVIDIA:main Dec 14, 2021
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Feb 21, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
@JanuszL JanuszL mentioned this pull request Mar 30, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
* Coalesce stores in Slice for smaller output types

This change coalesces stores to global memory in SliceGPU when
OutputType is smaller than 4 bytes in order to improve performance.

Signed-off-by: Szymon Karpiński <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants