Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in GPU culler for bind group too large. #541

Closed
John-Nagle opened this issue Dec 18, 2023 · 11 comments
Closed

Panic in GPU culler for bind group too large. #541

John-Nagle opened this issue Dec 18, 2023 · 11 comments

Comments

@John-Nagle
Copy link
Contributor

Internal panic in GPU culler when bind group is too large.

05:36:12 [ERROR] =========> Panic wgpu error: Validation Error

Caused by:
    In Device::create_bind_group
      note: label = `GpuCuller rend3_routine::pbr::material::PbrMaterial BG`
    Buffer binding 4 range 2147483656 exceeds `max_*_buffer_binding_size` limit 2147483648

 at file /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs, line 3111 in thread main.
Backtrace:
 libcommon::common::commonutils::catch_panic::{{closure}}
             at /home/john/projects/sl/SL-test-viewer/libcommon/src/common/commonutils.rs:215:25
 wgpu::backend::direct::default_error_handler
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:3111:5
 wgpu::backend::direct::ErrorSinkRaw::handle_error
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:3097:17
 wgpu::backend::direct::Context::handle_error
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:333:9
 <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_bind_group
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:1107:13
 <T as wgpu::context::DynContext>::device_create_bind_group
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/context.rs:2308:13
 wgpu::Device::create_bind_group
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/lib.rs:2507:26
 rend3_routine::culling::culler::GpuCuller::cull
             at /home/john/.cargo/git/checkouts/rend3-e03f89403de3386a/9065f1e/rend3-routine/src/culling/culler.rs:613:26
 rend3_routine::culling::culler::GpuCuller::add_culling_to_graph::{{closure}}
             at /home/john/.cargo/git/checkouts/rend3-e03f89403de3386a/9065f1e/rend3-routine/src/culling/culler.rs:757:30
 rend3::graph::graph::RenderGraph::execute
             at /home/john/.cargo/git/checkouts/rend3-e03f89403de3386a/9065f1e/rend3/src/graph/graph.rs:501:17

Rend3 rev = "9065f1e".

@John-Nagle
Copy link
Contributor Author

Running out of GPU memory in mesh creation is now being properly reported to the application level, and the program continues to run. So that worked. Looks like there are other places where that limit can be hit.

@cwfitzgerald
Copy link
Member

Interesting to note this is only 8 bytes over the limit, I wonder if this is an off-by-a-smidge error.

@John-Nagle
Copy link
Contributor Author

I'm operating very close to the limit right now. I create meshes until I hit the bind group limit and get the mesh error. Then I put the failed request on hold. New requests continue to hit the limit, and they, too, get put on hold There's a background task which manages levels of detail, and it will take steps to reduce the memory pressure and redo the failed items, but that's only partly written and not working yet. Once it's all working, it will only hit the limit occasionally. Then it will back off.

So if something in the GPU culler needs some bind group space during rendering, it's likely to hit the limit.

There are two ways to go at this. 1) Bang into the limit, get an error return, and recover. This requires that all components be able to operate right up to the limit. That's the current implementation. 2) Provide info on how much of the resource is left, so the application can back off before hitting the limit. The current choice is 1). I've figured out how to work with that, and it's going well.

With 2), it's necessary to have reliable info about how much of the resource is left. This is apparently difficult. Fragmentation may be an issue. (Does bind group space get fragmented?) It's extremely difficult to get memory info out of the WPGU and below levels, as I understand it. For Vulkan it's listed a proposed enhancement. So, as I understand it, we're stuck with 1).

@John-Nagle
Copy link
Contributor Author

Somewhat related: At the 2147483648 limit, my own count of vertices is 37098544.
That's 57.8 bytes per vertex. Reasonable?

@marstaik
Copy link

I'm getting this too, it happens randomly and I don't believe I am ever near the bind group limit

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Dec 30, 2023

So this problem is caused by the result index buffer getting too large - if the total indices in the scene are greater than 2^27, you'll hit this problem. This is one pretty major disadvantage of the culling system as it stands, and I'm currently scheming on how to remove this limit. I can raise it to 2^28 pretty easily as there's currently an off-by-8-bytes situation. But I'm generally concerned about the limitations the culling system has, and the minimal performance benefits, so I may remove it in favor of other culling techniques.

@John-Nagle
Copy link
Contributor Author

Sounds good. I've been able to rework things such that hitting the limit is now recoverable. It now tells the level of detail system to cut back on quality. But a higher ceiling would be nice.

@John-Nagle
Copy link
Contributor Author

I just built a version of Sharpview where this is a hard error that fails at startup every time even on simple scenes. In addition, the rendered images have random triangles all over the place. These have been rare intermittent problems for months, but now I have a soild repro.

04:14:15 [ERROR] =========> Panic wgpu error: Validation Error

Caused by:
In Device::create_bind_group
note: label = GpuCuller rend3_routine::pbr::material::PbrMaterial BG
Buffer binding 4 range 134217736 exceeds max_*_buffer_binding_size limit 134217728

at file /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs, line 3009 in thread main.
Backtrace:
libcommon::common::commonutils::catch_panic::{{closure}}
at /home/john/projects/sl/SL-test-viewer/libcommon/src/common/commonutils.rs:215:25
wgpu::backend::wgpu_core::default_error_handler
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:3009:5
wgpu::backend::wgpu_core::ErrorSinkRaw::handle_error
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:2995:17
wgpu::backend::wgpu_core::ContextWgpuCore::handle_error
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:262:9
<wgpu::backend::wgpu_core::ContextWgpuCore as wgpu::context::Context>::device_create_bind_group
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:1043:13
::device_create_bind_group
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/context.rs:2236:13
wgpu::Device::create_bind_group
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/lib.rs:2430:26
rend3_routine::culling::culler::GpuCuller::cull
at /home/john/.cargo/git/checkouts/rend3-issue570-7a55d7cece9b9b17/bafdc3b/rend3-routine/src/culling/culler.rs:614:26
rend3_routine::culling::culler::GpuCuller::add_culling_to_graph::{{closure}}
at /home/john/.cargo/git/checkouts/rend3-issue570-7a55d7cece9b9b17/bafdc3b/rend3-routine/src/culling/culler.rs:765:30
rend3::graph::graph::RenderGraph::execute
at /home/john/.cargo/git/checkouts/rend3-issue570-7a55d7cece9b9b17/bafdc3b/rend3/src/graph/graph.rs:503:17

This started failing after I changed some visibility of modules in mod.rs files. Didn't even change any code. So it may depend on memory layout. My own code is 100% safe Rust, so short of a compiler error, that shouldn't matter.

Saved the bad executable, did cargo clean, and rebuilt. Rebuilt version still fails in the same way. So it wasn't a transient bad compile.

This is a relatively simple test scene and is nowhere near the bind limit. I've tried logging into different places in Second Life and OSGrid, and all fail the same way.

@John-Nagle
Copy link
Contributor Author

cullersrash1
This is what I'm seeing on screen. Some legit content, some flickering triangles.

@John-Nagle
Copy link
Contributor Author

Fails in both debug and release mode in the same way. Just slower in debug.

@cwfitzgerald
Copy link
Member

Closed by #593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants