Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Write descriptor sets on the render thread #18328

Merged
merged 12 commits into from
Oct 10, 2023

Conversation

hrydgard
Copy link
Owner

@hrydgard hrydgard commented Oct 9, 2023

Very measurable speed boost even on PC, but even more so on for example PowerVR - on one test device, it really smoothed Tekken 6 out (well, except the shader compiles...)

Not super optimized so far (could do batching) and does the simplest possible de-duplication of, but getting work off the main thread is a win.

@hrydgard hrydgard added Vulkan GE emulation Backend-independent GPU issues labels Oct 9, 2023
@hrydgard hrydgard added this to the v1.17.0 milestone Oct 9, 2023
@hrydgard hrydgard force-pushed the desc-on-render-thread-2 branch 2 times, most recently from be4631a to 78c3c3b Compare October 9, 2023 19:38
@hrydgard hrydgard marked this pull request as ready for review October 9, 2023 21:20
@@ -68,14 +69,17 @@ VkDescriptorSet VulkanDescSetPool::Allocate(int n, const VkDescriptorSetLayout *
result = vkAllocateDescriptorSets(vulkan_->GetDevice(), &descAlloc, &desc);
_assert_msg_(result == VK_SUCCESS, "Ran out of descriptor space (frag?) and failed to allocate after recreating a descriptor pool. res=%d", (int)result);
}
usage_++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another one just below. Intentional to have two? Seems probably better in the other place.

-[Unknown]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this was a merge between two branches where I did the same fix in both..

@@ -1336,7 +1338,8 @@ void VulkanQueueRunner::PerformRenderPass(const VKRStep &step, VkCommandBuffer c

case VKRRenderCommand::DRAW_INDEXED:
if (pipelineOK) {
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &c.drawIndexed.ds, c.drawIndexed.numUboOffsets, c.drawIndexed.uboOffsets);
VkDescriptorSet set = (*descSets)[c.drawIndexed.descSetIndex].set;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe deserves a _dbg_assert_(set != VK_NULL_HANDLE); here as well, just for symmetry and to ensure no false sense of security when skimming the code?

-[Unknown]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

@@ -1462,6 +1477,11 @@ void VulkanRenderManager::Run(VKRRenderThreadTask &task) {
}
frameData.SubmitPending(vulkan_, FrameSubmitType::Pending, frameDataShared_);

// Flush descriptors.
double descStart = time_now_d();
FlushDescriptors(task.frame);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we could create these on a separate thread even, and wait for them, somewhat like shaders (of course, they'd need to be continuous, so this would only help if it could complete between the time drawing was enqueued and when the frame was picked up by the renderer.)

Not sure if the thread waits would definitely be worth it, but if this was a decent win, most devices nowadays have more cores than we'll ever be able to effectively use (I've been meaning to move atrac to a thread, for this reason - even if it only helps a tiny bit, I think it should help... does actually show up in profiles these days.)

-[Unknown]

Copy link
Owner Author

@hrydgard hrydgard Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I thought about that, but it doesn't really seem necessary - there's usually plenty of time left to do these on the render thread without slowing anything down, even on really slow devices. I suppose it slightly increases latency to do them in a big block before submitting draws.. But it might be worth experimenting with - the more parallelism, the more this crowd of CPUs can clock down...

@@ -1625,10 +1645,153 @@ VKRPipelineLayout *VulkanRenderManager::CreatePipelineLayout(BindingType *bindin

vulkan_->SetDebugName(layout->descriptorSetLayout, VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT, tag);
vulkan_->SetDebugName(layout->pipelineLayout, VK_OBJECT_TYPE_PIPELINE_LAYOUT, tag);

for (int i = 0; i < VulkanContext::MAX_INFLIGHT_FRAMES; i++) {
layout->frameData[i].pool.Create(vulkan_, bindingTypes, (uint32_t)bindingTypesCount, 512);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 512 enough? Just asking because we've had to raise these numbers before. I suppose it's for each pipeline... not really sure what the cost is here of increasing it.

Well, maybe it was 512 before...

-[Unknown]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not for each pipeline, it's for each pipeline layout, and we only use two - one for UI and one for in-game.

1024 might be a more reasonable starting point, a lot of games end up there, while much fewer ones go all the way to 2048 and beyond (basically GTA).

@@ -83,8 +83,6 @@ void UIContext::BeginPipeline(Draw::Pipeline *pipeline, Draw::SamplerState *samp
// Also clear out any other textures bound.
Draw::SamplerState *samplers[3]{ samplerState };
draw_->BindSamplerStates(0, 3, samplers);
Draw::Texture *textures[2]{};
draw_->BindTextures(1, 2, textures);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what was the reason for removing these? I feel like we've hit errors and problems when bad things are left bound, so I think this was intended to wipe it. Note that this is clearing slots 1 and 2, and RebindTexture() is setting slot 0.

Maybe it's not needed on every BeginPipeline or something, but I'm assuming it was added to fix some error...

-[Unknown]

Copy link
Owner Author

@hrydgard hrydgard Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I had some wacky bugs initially by using the wrong offset in the descriptor writer, this was meant to be a temporary change, oops..


// This is where we look up to see if we already have an identical descriptor previously in the array.
// We could do a simple custom hash map here that doesn't handle collisions, since false positives aren't too bad.
// Instead, for now we just check history one item backwards. Good enough, it seems.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd guess it being a match to the one previous happens a ton. The next most likely would be the one before the most recent string of them (which I think we'd need a local var to track.) But that only happens in some games, where they'll flip flop between textures. But not sure if that happens enough to be worth it...

-[Unknown]

}
}

// TODO: Allocate in batches.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bad is it to overallocate? It'd probably be relatively easy to allocate just in chunks of say 8 or something, exhaust and refill. Or could just estimate a fraction of descSets.size() at the start, and fall back to one-at-a-time if it's somehow over the limit.

Wonder if there's benefit or downside to putting more distance between the allocation and writing (i.e. do we want all writes for the small batches to happen right after the alloc, or better to allocate everything, then do writes in batched chunks...) Well, just pondering, I'm sure the answer is virtually unknowable with all the drivers out there...

-[Unknown]

Copy link
Owner Author

@hrydgard hrydgard Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, small batch allocations might be the way to go just to reduce function call overhead to basically nothing.

Another variant advocated by ARM for cases like this where we start over every frame, is to simply don't reset and allocate, just stomp over and reuse your already allocated descriptor sets.

Many things to try, but now that this is no longer on the main thread, it just doesn't seem to matter hugely and as you say it's unknowable - so I think just doing small block allocations here is the safest way. I'm thinking 16 - or maybe 8 is enough indeed. I'll do that in a followup though.

@hrydgard hrydgard force-pushed the desc-on-render-thread-2 branch from 04b8c50 to 9fdc7e2 Compare October 10, 2023 07:05
@hrydgard hrydgard merged commit 9767f43 into master Oct 10, 2023
@hrydgard hrydgard deleted the desc-on-render-thread-2 branch October 10, 2023 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GE emulation Backend-independent GPU issues Vulkan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants