Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applications using wgpu hang forever on bleeding edge Linux with Nvidia drivers 545.29.06 on GNOME / Wayland #4775

Closed
udoprog opened this issue Nov 25, 2023 · 15 comments
Labels
api: vulkan Issues with Vulkan area: wsi Issues with swapchain management or windowing external: driver-bug A driver is causing the bug, though we may still want to work around it

Comments

@udoprog
Copy link
Contributor

udoprog commented Nov 25, 2023

Repro steps
Running anything which tries to use wgpu Vulkan, like:

cd examples && cargo run cube

The window starts and renders at least one frame, but becomes completely non-interactive (windows can't be interacted with or moved) and you receive a "hanging" prompt from GNOME:

image

Note that I think this might legitimately be a platform issue, however:

  • I am unable to reproduce it with both vkcube (X11) and vkcube-wayland which reports and runs (see below).
  • And winit examples run without issues.
> sudo dnf install vulkan-tools
> vkcube-wayland
Selected GPU 0: NVIDIA GeForce RTX 2080 Ti, type: DiscreteGpu
Screencast.from.2023-11-25.18-23-21.webm

So wgpu is currently the lowest level of abstraction I've chased down.

Platform

Log output from running the example:

wgpu_core::instance] Adapter Vulkan AdapterInfo { name: "NVIDIA GeForce RTX 2080 Ti", vendor: 4318, device: 7687, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "545.29.06", backend: Vulkan }

uname -r:

6.7.0-0.rc2.20231122gitc2d5304e6c64.23.fc40.x86_64
@udoprog
Copy link
Contributor Author

udoprog commented Nov 25, 2023

This is probably a duplicate of #4689, but I'm just gonna add what I've found so far here:

This is where we hang forever:

unsafe { sc.device.raw.wait_for_fences(fences, true, !0) }

Out of curiosity, I added some instrumentation:

let fences = &[sc.fence];

unsafe {
    let status = sc.device.raw.get_fence_status(sc.fence)
        .map_err(crate::DeviceError::from)?;
    println!("wait: {}", status);
    sc.device.raw.wait_for_fences(fences, true, !0)
        .map_err(crate::DeviceError::from)?;
    sc.device.raw.reset_fences(fences).map_err(crate::DeviceError::from)?
}

It seems to hang (all though checking the status is racy) when the fence is not already signaled:

wait: true
... repeats a few hundred times
wait: true
wait: false

Note that vulkan-tools cube uses a semaphore for synchronization, so it seems like fences are buggy. And it's very likely a platform issue.

https://github.com/KhronosGroup/Vulkan-Tools/blob/62c4f8f7c546662aa5d43ca185e7d478d1224fb1/cube/cube.c#L1080

@udoprog
Copy link
Contributor Author

udoprog commented Nov 25, 2023

This article also seems to suggest that using timeline semaphores are recommended over fences for host synchronization, so it might still be a worthwhile change in wgpu:

https://www.khronos.org/blog/vulkan-timeline-semaphores

@udoprog
Copy link
Contributor Author

udoprog commented Nov 25, 2023

Using a semaphore works for me, all though the patch I wrote isn't pretty. Preferably it should be used to wait for when submitting a command buffer.

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Nov 26, 2023

Thanks for the investigation into this!

This article also seems to suggest that using timeline semaphores are recommended over fences for host synchronization, so it might still be a worthwhile change in wgpu:

Fences should still work. Either way you can't use timeline semaphores for swapchain stuff, you can only use binary semaphores. Does vkcube break if converted to wait for a fence?

This sounds like this is a driver bug and needs to be reported to nvidia.

@udoprog
Copy link
Contributor Author

udoprog commented Nov 26, 2023

So this is the patch I'm using on vulkan-tools.

# Seems to be easier to install X11 dependencies than disable the build
> sudo dnf install libxcb-devel libX11-devel libXrandr-devel wayland-devel
cmake -S . -B build-release -D UPDATE_DEPS=ON -D BUILD_WERROR=ON -D BUILD_TESTS=ON -D CMAKE_BUILD_TYPE=Release
cmake --build build-release --config Release
./build-release/cube/vkcube-wayland

Note that this happen for both Release and Debug builds, I was using Release above in the hopes that I'd observe an unsignaled fence.

I'm currently not able to reproduce it with vkcube-wayland, but I'm also not able to observe an unsignaled fence:

> build-release/cube/vkcube-wayland
... lots of lines
before: 1
fence: 0

It's hard to say why. If someone has some other code they'd like me to run, I'd be happy to.

@udoprog
Copy link
Contributor Author

udoprog commented Nov 26, 2023

Fences should still work. Either way you can't use timeline semaphores for swapchain stuff, you can only use binary semaphores. Does vkcube break if converted to wait for a fence?

This sounds like this is a driver bug and needs to be reported to nvidia.

Sounds good, any idea where?

In the meanwhile since I'm not super familiar with wgpu, is there something that necessitates using a fence? It's not entirely clear over my brief skim of the implementation if that is necessary over using a semaphore and waiting for that as we submit a command buffer?

@teoxoy teoxoy added external: driver-bug A driver is causing the bug, though we may still want to work around it api: vulkan Issues with Vulkan area: wsi Issues with swapchain management or windowing labels Dec 11, 2023
@ids1024
Copy link
Contributor

ids1024 commented Jan 3, 2024

I am not familiar enough with Vulkan to know what the best thing to do here is, but the Nvidia driver does seem to be violating this rough guarantee of the Vulkan spec:

While we guarantee that vkWaitForFences must return in finite time, no guarantees are made that it returns immediately upon device loss. However, the client can reasonably expect that the delay will be on the order of seconds and that calling vkWaitForFences will not result in a permanently (or seemingly permanently) dead process.

So unless wgpu is violating the valid usages of the API (and thus triggered undefined behavior), calling vkWaitForFences shouldn't produce an indefinite hang like this.

So it seems fair to say this is at least partly a driver bug.

@udoprog
Copy link
Contributor Author

udoprog commented Jan 3, 2024

@ids1024 That the scenario cited is about what should happen during a device loss, which is something different from what happens here.

I don't know this for sure, but my current understanding is that the spec doesn't guarantee when the fence should be signaled, because the presentation engine might opt to hold onto the swapchain image for as long as it wants to. Which here seem to be up until a new frame is being submitted or presented. Android apparently does something like that so that it can use the swapchain image for things between render calls.

At least that is my conclusion from a careful read of the spec regarding the relevant functions. That doesn't mean Nvidia might not still be interested in fixing it. That being said, the vast majority of applications do what I've proposed in #4967 so we probably just want to do that as well to avoid problems.

@ids1024
Copy link
Contributor

ids1024 commented Jan 3, 2024

Ah, I guess the line above that says the "return in finite time" guarantee is about device loss, so there's no mentioned guarantee it won't block indefinitely in other circumstances.

@anarsoul
Copy link

@ryzendew
Copy link

I reported this issue direct to an nvidia linux driver dev

@zocker-160
Copy link

for the record:
Nvidia bug report by @ryzendew https://forums.developer.nvidia.com/t/wgpu-driver-bug/280420

@krakow10
Copy link

krakow10 commented Mar 25, 2024

Hi all! This is supposedly fixed in Nvidia driver 550.67 as can be seen in the driver release notes, and I can confirm that it works with my personal project using wgpu.

@kaimast
Copy link

kaimast commented Mar 25, 2024

I just checked (Gnome 46, Wayland, and Nvidia 550.67) and the problem is gone for me!

@Wumpf
Copy link
Member

Wumpf commented Mar 25, 2024

sounds great! closing this as fixed then until we have new reason to believe otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vulkan Issues with Vulkan area: wsi Issues with swapchain management or windowing external: driver-bug A driver is causing the bug, though we may still want to work around it
Projects
None yet
Development

No branches or pull requests

10 participants