[WIP] Background thread for automatic device polling #1891

xq-tec · 2021-08-31T20:12:20Z

Connections
See #1871 which discusses the general problem and solution.

Description
The PR spawns an optional background thread which polls the device when a buffer is mapped. This enables more idiomatic code:

let buffer_slice = buffer.slice(..);
if let Ok(()) = buffer_slice.map_async(...).await {
   // ...

instead of

let buffer_slice = buffer.slice(..);
let fut = buffer_slice.map_async(...);
device.poll();
if let Ok(()) = fut.await {
   // ...

This solution adds a closure to wgpu_core::device::Device, which is called when a buffer mapping is set up. When a wgpu::Device is created, this closure is optionally set to a function which triggers a device poll on a background thread.

xq-tec · 2021-09-02T04:31:02Z

I've fixed the compile error in player, and the problems with the lints (I hope).

pythonesque · 2021-09-03T07:33:56Z

FWIW, as part of my upcoming changes, we'll be able to run maintain in multiple threads at once for the same queue as well as run it independently per-queue, which I think would conflict somewhat with your change.

IMO something like this is not really a substitute for a well-integrated runtime, wgpu should ideally provide some sort of runtime integration rather than just spawning a background thread like this (plus, in many applications, spawning background threads can be quite detrimental to performance, especially if they make heavy use of thread locals, so I'm worried about integrating this too deeply into wgpu). For an example of what I mean, look at how flexible rayon / crossbeam integration are, where the user has total control over the thread pool, the workers, etc.

xq-tec · 2021-09-03T10:46:53Z

FWIW, as part of my upcoming changes, we'll be able to run maintain in multiple threads at once for the same queue as well as run it independently per-queue, which I think would conflict somewhat with your change.

Can you link your WIP? Then I could have a look.

IMO something like this is not really a substitute for a well-integrated runtime, wgpu should ideally provide some sort of runtime integration

How would this work? The fundamental problem is that there's no way to have the GPU driver call a function when some operation completes, so the application itself needs to block on/poll a fence and call the future resolution. The best and most efficient way to do this is in a background thread.

rather than just spawning a background thread like this (plus, in many applications, spawning background threads can be quite detrimental to performance, especially if they make heavy use of thread locals, so I'm worried about integrating this too deeply into wgpu).

Well, first, wgpu and its dependencies already spawn a lot of background threads before it even starts doing anything useful (~15 for the GL and ~25 for the Vulkan backend, on my system), so another thread can't possibly hurt too much.

Second, the background thread in this PR will probably never use more than its initial page of stack memory, and it is almost always idle, except for two points in time: when it is woken up and starts blocking on the device, and when the device finishes its work and the thread wakes up the blocked futures. The resources consumed by this thread are absolutely negligible compared to anything a non-trivial wgpu application will use.

Third, it's not deeply integrated into wgpu at all: The creation of the background thread is completely contained within the wgpu crate. The wgpu-core crate only contains the callback closure of the form Option<Arc<dyn Fn() + Send + Sync>>, which is very flexible and lends itself well for other ways of polling the device.

For an example of what I mean, look at how flexible rayon / crossbeam integration are, where the user has total control over the thread pool, the workers, etc.

Applications which want and/or need total control over everything won't use wgpu anyway, but Vulkan or DirectX directly. And even if they do, it's very easy to opt-out of creating the background thread.

xq-tec · 2021-09-09T15:03:38Z

So, any updates on this? Yes/no/maybe?

kvark · 2021-09-09T22:12:32Z

Sorry about the delay! I was hoping that @grovesNL can make a call on this.

grovesNL

This seems like a great start for a basic runtime, and allows futures to work roughly the the same way across web and native whenever auto-poll is enabled. I think it would be good to proceed with this.

As @xq-tec mentioned, it's easy to opt-out of this to manually control polling in the same way it currently works. We could also add direct integration with runtimes later on if we'd like (e.g. reusing an existing thread pool provided by another crate), or more detailed thread control.

kvark

A few high-level questions that I find critical:

why is this limited to buffer mapping? There are other async operations, like on_submitted_work_done.
does this have to be integrated into wgpu-core at all? Can the solution live exclusively in wgpu-rs instead?
in line with @pythonesque reasoning, would this API be compatible with other run-times, like winit's event loop?

xq-tec · 2021-09-10T07:20:18Z

Thanks, @kvark, @grovesNL!

why is this limited to buffer mapping? There are other async operations, like on_submitted_work_done.

Because I didn't think of it, to be honest. It should be easy to support Queue::on_submitted_work_done(); I'll add it to the pull request. All the other future-returning methods in wgpu are either called before device creation, or return a future from map_async(), right?

does this have to be integrated into wgpu-core at all?

I think there is one way to do it without wgpu-core. It would involve adding an Option<Arc<dyn Fn() + Send + Sync>> to either wgpu::Buffer or wgpu::backend::direct::Buffer, but 16 bytes per buffer instance won't hurt. Let me implement it, then I'll open an alternative pull request, and you can decide which version is better.

kvark · 2021-09-10T13:22:50Z

I guess I'm missing something conceptually. Why does there need to be a callback from wgpu-core at all? wgpu-core can't kick off events by itself, it only does so via maintain(). So "userspace" (from wgpu-core point of view, like wgpu-rs) can just call maintain(), what benefit does another callback have?

xq-tec · 2021-09-10T14:12:30Z

Why does there need to be a callback from wgpu-core at all?

Because wgpu::Buffer has no reference to the wgpu::Device it belongs to. So when you call wgpu::BufferSlice::map_async(), there's no way to call a closure stored in wgpu::Device. Only at the level of wgpu-core there exists a link from the buffer to the device, so that the buffer can trigger the closure stored in the wgpu_core::device::Device.

But I think there's a way around this: If we store the callback closure in wgpu::Device and Arc::clone() the closure for every created wgpu::Buffer, then wgpu::BufferSlice::map_async() could directly call the closure, without touching wgpu-core at all. It might even be sufficient to put the closure into wgpu::backend::direct::Device and wgpu::backend::direct::Buffer, since the web backend doesn't need to trigger any polling.

kvark · 2021-09-10T14:36:46Z

ok, why do we need to call any closures on map_async?

xq-tec · 2021-09-10T14:41:08Z

The closure wakes up the background thread, which then executes the equvalent of Device::poll(Maintain::Wait).

kvark · 2021-09-10T15:17:51Z

I'm quite confused. So the background thread parks itself and waits, only getting woken up on that signal. When is this signal sent? On map_async the signal is useless, because this is not when the mapping is resolved, it's when it's requested. I.e. the buffer may be used by the GPU at that point, so Poll would do nothing.

xq-tec · 2021-09-10T15:26:20Z

When the background thread is woken up, it executes global.device_poll(device_id, true) – note the value true for the force_wait parameter – which is equivalent to Device::poll(Maintain::Wait). So the device_poll() call blocks until the device is idle, driving all futures to resolution, including the one which was created for the map_async() call which woke up the background thread.

kvark · 2021-09-10T15:33:09Z

I see. So, if wgpu::Buffer could call it, you wouldn't even need any callback mechanism? I don't think there is any strong reason that wgpu::Buffer couldn't keep a reference to the device. It's very much doable.

There is a problem, however. If mapping buffer is always blocking the device (using the wait = true parameter), then nothing else can do anything on the device. So this code becomes problematic:

buffer.map_async(...);
// these operations are going to be blocked, since the device is busy waiting
let texture = device.create_texture(...);
queue.submit(...);

So we end up in a strange situation, where on one hand there is a background thread that's meant to provide asynchronuity. But on the other hand we are still effectively blocked on the main thread.

xq-tec · 2021-09-10T15:46:26Z

Wait, so when device.poll(Maintain::Wait) is called anywhere on any thread, all other methods called on that device will block on all other threads until the device is idle?
Well, shit, that would completely invalidate any background thread approach... 😞

Is there a way around this? I.e., would it be possible to wait on the "most recent fence" from one thread, but allow pushing to the queue on other threads?

kvark · 2021-09-10T15:56:38Z

I'm double-checking now. Maintain() locks the following:

device hub for reading. This means you can't asynchronously do any of the following (search devices.write to see yourself):
- destroy buffers or textures (destroy != drop), or command encoders
- unmap buffers
- do any queue operations (submit, write buffer/texture)
lock the device's lifetime tracker. This means you can't do:
- queue submission
- creation of buffers that are mapped at creation
- drop any resources
- possibly other things...

I know this is highly unfortunate. The hub locking was never meant to be long term. Device polling with wait is a hack that probably needs to be removed entirely.
Also, with @pythonesque changes these lists may be reduced significantly, since there is going to be no Hub.

But as it stands now, the set of restrictions is pretty darn big, and it's very likely that the user code steps on one of them.

xq-tec · 2021-09-10T19:19:36Z

Thanks for the explanation, @kvark! I guess I'll wait until @pythonesque's changes land before working on this problem again.

xq-tec · 2021-09-14T20:38:15Z

Would it be possible to split wpgu_core Device::maintain(), so that the blocking wait is done outside the locks?
Roughly like this:

let (device_guard, mut token) = hub.devices.read(&mut token);
let mut life_tracker = self.lock_life(token);
triage_suspected(), triage_mapped() ...
// Drop locks:
drop(device_guard); drop(life_tracker);
// Blocking wait:
self.raw.wait(...)...
// Re-acquire locks:
let (device_guard, mut token) = hub.devices.read(&mut token);
let mut life_tracker = self.lock_life(token);
triage_submissions(), handle_mapping(), etc.
return closures;

This would solve the loss of asynchronicity due to device.poll(), but I'm not familiar enough with wgpu to understand whether the temporary unlock will lead to wrong behavior in triage_submissions() and handle_mapping().

kvark · 2021-09-14T22:44:26Z

self.raw.wait(...)

this raw is inside Device, which requires to be locked. So it's not trivial

Although technically we could make wgpu-hal expose some way of producing an independent "Waiter" object that one could use.

Ideally, we'd get @pythonesque changes instead

awillenbuecher-xq-tec added 2 commits August 31, 2021 19:46

Background thread for automatic device polling

7069b10

Break Arc cycle caused by background thread

e6cba74

xq-tec mentioned this pull request Aug 31, 2021

Built-in polling thread #1871

Open

kvark requested a review from grovesNL September 1, 2021 15:46

Fixed error in player and lint complaints

f8facf4

grovesNL reviewed Sep 10, 2021

View reviewed changes

kvark reviewed Sep 10, 2021

View reviewed changes

kvark changed the title ~~Background thread for automatic device polling~~ [WIP] Background thread for automatic device polling Sep 10, 2021

kvark force-pushed the master branch from a8cf45d to 873e83c Compare December 11, 2021 22:27

jimblandy mentioned this pull request May 7, 2022

Return queue_empty for Device::poll #2643

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Background thread for automatic device polling #1891

[WIP] Background thread for automatic device polling #1891

xq-tec commented Aug 31, 2021 •

edited

Loading

xq-tec commented Sep 2, 2021

pythonesque commented Sep 3, 2021 •

edited

Loading

xq-tec commented Sep 3, 2021

xq-tec commented Sep 9, 2021

kvark commented Sep 9, 2021

grovesNL left a comment

kvark left a comment

xq-tec commented Sep 10, 2021

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021 •

edited

Loading

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021 •

edited

Loading

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021

xq-tec commented Sep 14, 2021

kvark commented Sep 14, 2021

[WIP] Background thread for automatic device polling #1891

[WIP] Background thread for automatic device polling #1891

Conversation

xq-tec commented Aug 31, 2021 • edited Loading

xq-tec commented Sep 2, 2021

pythonesque commented Sep 3, 2021 • edited Loading

xq-tec commented Sep 3, 2021

xq-tec commented Sep 9, 2021

kvark commented Sep 9, 2021

grovesNL left a comment

Choose a reason for hiding this comment

kvark left a comment

Choose a reason for hiding this comment

xq-tec commented Sep 10, 2021

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021 • edited Loading

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021 • edited Loading

kvark commented Sep 10, 2021

xq-tec commented Sep 10, 2021

xq-tec commented Sep 14, 2021

kvark commented Sep 14, 2021

xq-tec commented Aug 31, 2021 •

edited

Loading

pythonesque commented Sep 3, 2021 •

edited

Loading

xq-tec commented Sep 10, 2021 •

edited

Loading

xq-tec commented Sep 10, 2021 •

edited

Loading