You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
this is a snippet of launch of a CUDA kernel from DispatchPlan module.
...
if (app->configuration.num_streams >= 1) {
result = cuLaunchKernel(axis->VkFFTKernel, ..., app->configuration.stream[app->configuration.streamID], args, 0);
}
else {
result = cuLaunchKernel(axis->VkFFTKernel, ..., 0, args, 0);
}
// result checkif (app->configuration.num_streams > 1) {
app->configuration.streamID = app->configuration.streamCounter % app->configuration.num_streams;
if (app->configuration.streamCounter == 0) {
cudaError_t res2 = cudaEventRecord(app->configuration.stream_event[app->configuration.streamID], app->configuration.stream[app->configuration.streamID]);
if (res2 != cudaSuccess) return VKFFT_ERROR_FAILED_TO_EVENT_RECORD;
}
app->configuration.streamCounter++;
}
...
I do not understand several things about this code:
Why is the kernel launched every time into different stream? I see that in the RunApp module you call VkFFTSync after each kernel launch. I think that it is not necessary unless you want to execut the work in parallel.
Is it correct that only the event at index 0 is ever recorded to a stream because streamCount? It seems more like a mistake.
Then here is a snippet from a VkFFTSync function.
...
if (app->configuration.num_streams > 1) {
cudaError_t res = cudaSuccess;
for (pfUINT s = 0; s < app->configuration.num_streams; s++) {
res = cudaEventSynchronize(app->configuration.stream_event[s]);
if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE;
}
app->configuration.streamCounter = 0;
}
...
Here is the synchronization of multiple CUDA streams. If I am not wrong, the it synchronizes events that were never launched into a stream. Also it makes the application synchronous, I guess that cudaStreamWaitEvent function would be more suitable in this case.
But overall I feel like that the whole design of using multiple streams is wrong. What I think is right would be:
When the plan is created, same number of events as is the stream count should be created.
Then when the VkFFTAppend function is called this should happen:
Events should be recorded into each except the first stream via cudaEventRecord.
The first stream should wait for all of the work in other streams to finish by calling cudaStreamWaitEvent on each except the first event.
All of the work should be launched into the first stream.
When everything is done, the first event shall be recorded into the first stream via cudaEventRecord
All of the streams except the first one shall call cudaStreamWaitEvent on the first event.
The user launch more work into the streams.
This attitude should work fine and even allow the usage of CUDA Graphs via stream capture. HIP has the exact same story.
Thanks!
David
The text was updated successfully, but these errors were encountered:
multiple streams was a test to mimic the Vulkan behavior of shader dispatches to the pipeline, where unless synchronized they launch without waiting for completion of the last shader - unlike the kernel model of CUDA, where kernels wait for previous kernels. The usability of it turned out to be very limited - only if there are multiple dispatches of kernel when the grid dimensions go out device limits (65k for y and z). However, these workloads are typically big and utilize GPU fully by themselves with low CPU overhead, so using multiple streams was not useful at all. I think you are correct that the synchronization is messed up currently for this version, I will need to check in detail your changes when I have more time.
Hi,
this is a snippet of launch of a CUDA kernel from
DispatchPlan
module.I do not understand several things about this code:
RunApp
module you callVkFFTSync
after each kernel launch. I think that it is not necessary unless you want to execut the work in parallel.Then here is a snippet from a
VkFFTSync
function.Here is the synchronization of multiple CUDA streams. If I am not wrong, the it synchronizes events that were never launched into a stream. Also it makes the application synchronous, I guess that
cudaStreamWaitEvent
function would be more suitable in this case.But overall I feel like that the whole design of using multiple streams is wrong. What I think is right would be:
VkFFTAppend
function is called this should happen:cudaEventRecord
.cudaStreamWaitEvent
on each except the first event.cudaEventRecord
cudaStreamWaitEvent
on the first event.This attitude should work fine and even allow the usage of CUDA Graphs via stream capture. HIP has the exact same story.
Thanks!
David
The text was updated successfully, but these errors were encountered: