-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan: Write descriptor sets on the render thread #18328
Conversation
be4631a
to
78c3c3b
Compare
Common/GPU/Vulkan/VulkanDescSet.cpp
Outdated
@@ -68,14 +69,17 @@ VkDescriptorSet VulkanDescSetPool::Allocate(int n, const VkDescriptorSetLayout * | |||
result = vkAllocateDescriptorSets(vulkan_->GetDevice(), &descAlloc, &desc); | |||
_assert_msg_(result == VK_SUCCESS, "Ran out of descriptor space (frag?) and failed to allocate after recreating a descriptor pool. res=%d", (int)result); | |||
} | |||
usage_++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another one just below. Intentional to have two? Seems probably better in the other place.
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this was a merge between two branches where I did the same fix in both..
@@ -1336,7 +1338,8 @@ void VulkanQueueRunner::PerformRenderPass(const VKRStep &step, VkCommandBuffer c | |||
|
|||
case VKRRenderCommand::DRAW_INDEXED: | |||
if (pipelineOK) { | |||
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &c.drawIndexed.ds, c.drawIndexed.numUboOffsets, c.drawIndexed.uboOffsets); | |||
VkDescriptorSet set = (*descSets)[c.drawIndexed.descSetIndex].set; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe deserves a _dbg_assert_(set != VK_NULL_HANDLE);
here as well, just for symmetry and to ensure no false sense of security when skimming the code?
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
@@ -1462,6 +1477,11 @@ void VulkanRenderManager::Run(VKRRenderThreadTask &task) { | |||
} | |||
frameData.SubmitPending(vulkan_, FrameSubmitType::Pending, frameDataShared_); | |||
|
|||
// Flush descriptors. | |||
double descStart = time_now_d(); | |||
FlushDescriptors(task.frame); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, we could create these on a separate thread even, and wait for them, somewhat like shaders (of course, they'd need to be continuous, so this would only help if it could complete between the time drawing was enqueued and when the frame was picked up by the renderer.)
Not sure if the thread waits would definitely be worth it, but if this was a decent win, most devices nowadays have more cores than we'll ever be able to effectively use (I've been meaning to move atrac to a thread, for this reason - even if it only helps a tiny bit, I think it should help... does actually show up in profiles these days.)
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I thought about that, but it doesn't really seem necessary - there's usually plenty of time left to do these on the render thread without slowing anything down, even on really slow devices. I suppose it slightly increases latency to do them in a big block before submitting draws.. But it might be worth experimenting with - the more parallelism, the more this crowd of CPUs can clock down...
@@ -1625,10 +1645,153 @@ VKRPipelineLayout *VulkanRenderManager::CreatePipelineLayout(BindingType *bindin | |||
|
|||
vulkan_->SetDebugName(layout->descriptorSetLayout, VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT, tag); | |||
vulkan_->SetDebugName(layout->pipelineLayout, VK_OBJECT_TYPE_PIPELINE_LAYOUT, tag); | |||
|
|||
for (int i = 0; i < VulkanContext::MAX_INFLIGHT_FRAMES; i++) { | |||
layout->frameData[i].pool.Create(vulkan_, bindingTypes, (uint32_t)bindingTypesCount, 512); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 512 enough? Just asking because we've had to raise these numbers before. I suppose it's for each pipeline... not really sure what the cost is here of increasing it.
Well, maybe it was 512 before...
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not for each pipeline, it's for each pipeline layout, and we only use two - one for UI and one for in-game.
1024 might be a more reasonable starting point, a lot of games end up there, while much fewer ones go all the way to 2048 and beyond (basically GTA).
Common/UI/Context.cpp
Outdated
@@ -83,8 +83,6 @@ void UIContext::BeginPipeline(Draw::Pipeline *pipeline, Draw::SamplerState *samp | |||
// Also clear out any other textures bound. | |||
Draw::SamplerState *samplers[3]{ samplerState }; | |||
draw_->BindSamplerStates(0, 3, samplers); | |||
Draw::Texture *textures[2]{}; | |||
draw_->BindTextures(1, 2, textures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, what was the reason for removing these? I feel like we've hit errors and problems when bad things are left bound, so I think this was intended to wipe it. Note that this is clearing slots 1 and 2, and RebindTexture() is setting slot 0.
Maybe it's not needed on every BeginPipeline or something, but I'm assuming it was added to fix some error...
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I had some wacky bugs initially by using the wrong offset in the descriptor writer, this was meant to be a temporary change, oops..
|
||
// This is where we look up to see if we already have an identical descriptor previously in the array. | ||
// We could do a simple custom hash map here that doesn't handle collisions, since false positives aren't too bad. | ||
// Instead, for now we just check history one item backwards. Good enough, it seems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd guess it being a match to the one previous happens a ton. The next most likely would be the one before the most recent string of them (which I think we'd need a local var to track.) But that only happens in some games, where they'll flip flop between textures. But not sure if that happens enough to be worth it...
-[Unknown]
} | ||
} | ||
|
||
// TODO: Allocate in batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bad is it to overallocate? It'd probably be relatively easy to allocate just in chunks of say 8 or something, exhaust and refill. Or could just estimate a fraction of descSets.size()
at the start, and fall back to one-at-a-time if it's somehow over the limit.
Wonder if there's benefit or downside to putting more distance between the allocation and writing (i.e. do we want all writes for the small batches to happen right after the alloc, or better to allocate everything, then do writes in batched chunks...) Well, just pondering, I'm sure the answer is virtually unknowable with all the drivers out there...
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, small batch allocations might be the way to go just to reduce function call overhead to basically nothing.
Another variant advocated by ARM for cases like this where we start over every frame, is to simply don't reset and allocate, just stomp over and reuse your already allocated descriptor sets.
Many things to try, but now that this is no longer on the main thread, it just doesn't seem to matter hugely and as you say it's unknowable - so I think just doing small block allocations here is the safest way. I'm thinking 16 - or maybe 8 is enough indeed. I'll do that in a followup though.
04b8c50
to
9fdc7e2
Compare
Very measurable speed boost even on PC, but even more so on for example PowerVR - on one test device, it really smoothed Tekken 6 out (well, except the shader compiles...)
Not super optimized so far (could do batching) and does the simplest possible de-duplication of, but getting work off the main thread is a win.