Unnecessary barrier + possibly incorrect barriers? #1097

jherico · 2024-01-10T22:45:49Z

You're moving both the storage buffers between the compute and graphics queues and vice versa, but only the storageBuffers.output.buffer is actually used on the graphics queue.

Vulkan/examples/computecloth/computecloth.cpp

Line 188 in 44ff7a1

bufferBarrier.buffer = storageBuffers.input.buffer;

Looking more closely at this, I think the logic in addComputeToComputeBarriers is wrong. The loop in buildComputeCommandBuffer executes 64 times, using the descriptor set to swap which buffer is being read and which one is being written, but the addComputeToComputeBarriers only ever does one thing... it ensures that both the input and output buffers are going from a write state to a read state.

One barrier should be going from read to write, and the other should be going from write to read, and the next frame they should swap back, over and over until the loop is exited, at which point the cross-queue barriers come into play.

The text was updated successfully, but these errors were encountered:

jherico · 2024-01-11T06:25:04Z

This is the intra-compute barrier function I've done in my port:

    void computeToComputeBarrier(const vk::CommandBuffer& cmdBuf, bool reverse) {
        std::array<vk::BufferMemoryBarrier2, 2> barriers;
        barriers[0] = vk::BufferMemoryBarrier2{ //
                                                // src stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderWrite,
                                                // dst stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderRead,
                                                // src and dst stage
                                                VK_QUEUE_FAMILY_IGNORED, VK_QUEUE_FAMILY_IGNORED,
                                                // buffer, offset and size
                                                reverse ? storageBuffers.input.buffer : storageBuffers.output.buffer, 0, VK_WHOLE_SIZE
        };
        barriers[1] = vk::BufferMemoryBarrier2{ //
                                                // src stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderRead,
                                                // dst stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderWrite,
                                                // src and dst stage
                                                VK_QUEUE_FAMILY_IGNORED, VK_QUEUE_FAMILY_IGNORED,
                                                // buffer, offset and size
                                                reverse ? storageBuffers.output.buffer : storageBuffers.input.buffer, 0, VK_WHOLE_SIZE
        };
        cmdBuf.pipelineBarrier2(vk::DependencyInfo{ {}, nullptr, barriers, nullptr });
    }

I'm able to run this without any validation errors, though I'm not getting validation errors on your version either, so I guess the layers can't detect something subtle like this.

jherico · 2024-01-11T19:27:35Z

Sorry, I didn't realize you already had a global "rework sync" issue that would cover this here #871

SRSaunders · 2024-05-01T05:35:28Z

I may have been the last one to touch this sync code, so I should probably try to answer your comments above:

You're moving both the storage buffers between the compute and graphics queues and vice versa, but only the storageBuffers.output.buffer is actually used on the graphics queue.

I think you may be right about this for the compute-to-graphics transition.

One barrier should be going from read to write, and the other should be going from write to read, and the next frame they should swap back, over and over until the loop is exited, at which point the cross-queue barriers come into play.

The only compute queue memory barriers that matter are the write-to-read barriers. These are needed to ensure the previous stage write operations are completed before reads happen in the next stage of the pipeline (when the buffers flip from write-mode to read-mode). That's why you only see write-to-read barriers in the code. Read-to-write operations do not require memory barriers - these are execution based. See https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples-(Legacy-synchronization-APIs) for more info.

I will have another look at the barriers, including the cross-queue ones, to see if any need correction or simplification (i.e. make sure masks are correct and get rid of unneeded barriers for buffers not involved in transitions).

SRSaunders · 2024-05-16T22:09:22Z

After studying the existing code, the compute-to-compute buffer memory barriers are functional as is but could use some optimization based on your analysis above. The write-to-read barrier is correct, but only needs to be applied to one of the buffers during each call, i.e. the one being written to. There is no need to create a memory barrier for the other buffer which is being read from, since the same vkCmdPipelineBarrier() call provides an execution barrier which is enough.

The split queue transfer operations use access masks and pipeline stages for the graphics and compute queues. For the graphics -> compute queue transfer I think it's better to use VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT instead of VK_PIPELINE_STAGE_VERTEX_INPUT_BIT, and also use VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT for the access mask.

Regarding the "extra" queue ownership transitions for input.buffer, it's true that it doesn't need to be moved back and forth between queues, but the action is harmless and does not cause performance issues. I think it's better to just leave it alone for simplicity and consistency.

Thanks @jherico for pointing out these issues. I will submit a PR that addresses them, and in addition adds double buffering for the compute queue for parallel operation with the graphics queue. I always wondered why there were 2 compute command buffers defined, but only 1 was used. I suspect the double buffering code was never completed. I have completed this with a small number of changes primarily to the semaphores. The results are quite interesting and speed up the fps for this example by ~50% on Windows, and >100% on Linux. No change on macOS since there is no dedicated compute queue on that platform - it's shared with the graphics queue.

Here are some RenderDoc screen caps so you can see what is happening with the compute buffers in the compute-to-compute transitions (using my pending PR). The purple triangles are the memory barriers, and they follow the yellow write operations. The read operations are also marked in yellow, but they are not followed by memory barriers. You can see how the memory barriers are interleaved for input.buffer and output.buffer.

For input.buffer:

For output.buffer

jherico · 2024-05-21T04:10:56Z

My experience with this is that when I corrected the barriers, the behavior changed drastically, specifically the amount of computed movement per frame increased.

I suspect that the current barrier setup may be sufficiently correct (or just too complicated) to satisfy the validation layers but is causing the computations in the 64-pass compute loop to be overwritten frequently.

I'll try to grab a local renderdoc capture and see if I can verify the issue.

SaschaWillems · 2024-10-11T12:43:27Z

There are now two pull request for this. Which one should a look into?

jherico changed the title ~~Unnecessary barrier~~ Unnecessary barrier + possibly incorrect barriers? Jan 11, 2024

jherico linked a pull request Feb 5, 2024 that will close this issue

Fix barrier issues in computecloth #1103

Open

SRSaunders mentioned this issue May 17, 2024

computecloth: Optimize barriers and add compute queue double buffering #1128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary barrier + possibly incorrect barriers? #1097

Unnecessary barrier + possibly incorrect barriers? #1097

jherico commented Jan 10, 2024 •

edited

Loading

jherico commented Jan 11, 2024 •

edited

Loading

jherico commented Jan 11, 2024

SRSaunders commented May 1, 2024 •

edited

Loading

SRSaunders commented May 16, 2024 •

edited

Loading

jherico commented May 21, 2024

SaschaWillems commented Oct 11, 2024

Unnecessary barrier + possibly incorrect barriers? #1097

Unnecessary barrier + possibly incorrect barriers? #1097

Comments

jherico commented Jan 10, 2024 • edited Loading

jherico commented Jan 11, 2024 • edited Loading

jherico commented Jan 11, 2024

SRSaunders commented May 1, 2024 • edited Loading

SRSaunders commented May 16, 2024 • edited Loading

jherico commented May 21, 2024

SaschaWillems commented Oct 11, 2024

jherico commented Jan 10, 2024 •

edited

Loading

jherico commented Jan 11, 2024 •

edited

Loading

SRSaunders commented May 1, 2024 •

edited

Loading

SRSaunders commented May 16, 2024 •

edited

Loading