Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Background thread for automatic device polling #1891

Closed
wants to merge 3 commits into from
Closed

[WIP] Background thread for automatic device polling #1891

wants to merge 3 commits into from

Conversation

xq-tec
Copy link

@xq-tec xq-tec commented Aug 31, 2021

Connections
See #1871 which discusses the general problem and solution.

Description
The PR spawns an optional background thread which polls the device when a buffer is mapped. This enables more idiomatic code:

let buffer_slice = buffer.slice(..);
if let Ok(()) = buffer_slice.map_async(...).await {
   // ...

instead of

let buffer_slice = buffer.slice(..);
let fut = buffer_slice.map_async(...);
device.poll();
if let Ok(()) = fut.await {
   // ...

This solution adds a closure to wgpu_core::device::Device, which is called when a buffer mapping is set up. When a wgpu::Device is created, this closure is optionally set to a function which triggers a device poll on a background thread.

@xq-tec
Copy link
Author

xq-tec commented Sep 2, 2021

I've fixed the compile error in player, and the problems with the lints (I hope).

@pythonesque
Copy link
Contributor

pythonesque commented Sep 3, 2021

FWIW, as part of my upcoming changes, we'll be able to run maintain in multiple threads at once for the same queue as well as run it independently per-queue, which I think would conflict somewhat with your change.

IMO something like this is not really a substitute for a well-integrated runtime, wgpu should ideally provide some sort of runtime integration rather than just spawning a background thread like this (plus, in many applications, spawning background threads can be quite detrimental to performance, especially if they make heavy use of thread locals, so I'm worried about integrating this too deeply into wgpu). For an example of what I mean, look at how flexible rayon / crossbeam integration are, where the user has total control over the thread pool, the workers, etc.

@xq-tec
Copy link
Author

xq-tec commented Sep 3, 2021

FWIW, as part of my upcoming changes, we'll be able to run maintain in multiple threads at once for the same queue as well as run it independently per-queue, which I think would conflict somewhat with your change.

Can you link your WIP? Then I could have a look.

IMO something like this is not really a substitute for a well-integrated runtime, wgpu should ideally provide some sort of runtime integration

How would this work? The fundamental problem is that there's no way to have the GPU driver call a function when some operation completes, so the application itself needs to block on/poll a fence and call the future resolution. The best and most efficient way to do this is in a background thread.

rather than just spawning a background thread like this (plus, in many applications, spawning background threads can be quite detrimental to performance, especially if they make heavy use of thread locals, so I'm worried about integrating this too deeply into wgpu).

Well, first, wgpu and its dependencies already spawn a lot of background threads before it even starts doing anything useful (~15 for the GL and ~25 for the Vulkan backend, on my system), so another thread can't possibly hurt too much.

Second, the background thread in this PR will probably never use more than its initial page of stack memory, and it is almost always idle, except for two points in time: when it is woken up and starts blocking on the device, and when the device finishes its work and the thread wakes up the blocked futures. The resources consumed by this thread are absolutely negligible compared to anything a non-trivial wgpu application will use.

Third, it's not deeply integrated into wgpu at all: The creation of the background thread is completely contained within the wgpu crate. The wgpu-core crate only contains the callback closure of the form Option<Arc<dyn Fn() + Send + Sync>>, which is very flexible and lends itself well for other ways of polling the device.

For an example of what I mean, look at how flexible rayon / crossbeam integration are, where the user has total control over the thread pool, the workers, etc.

Applications which want and/or need total control over everything won't use wgpu anyway, but Vulkan or DirectX directly. And even if they do, it's very easy to opt-out of creating the background thread.

@xq-tec
Copy link
Author

xq-tec commented Sep 9, 2021

So, any updates on this? Yes/no/maybe?

@kvark
Copy link
Member

kvark commented Sep 9, 2021

Sorry about the delay! I was hoping that @grovesNL can make a call on this.

Copy link
Collaborator

@grovesNL grovesNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a great start for a basic runtime, and allows futures to work roughly the the same way across web and native whenever auto-poll is enabled. I think it would be good to proceed with this.

As @xq-tec mentioned, it's easy to opt-out of this to manually control polling in the same way it currently works. We could also add direct integration with runtimes later on if we'd like (e.g. reusing an existing thread pool provided by another crate), or more detailed thread control.

Copy link
Member

@kvark kvark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high-level questions that I find critical:

  1. why is this limited to buffer mapping? There are other async operations, like on_submitted_work_done.
  2. does this have to be integrated into wgpu-core at all? Can the solution live exclusively in wgpu-rs instead?
  3. in line with @pythonesque reasoning, would this API be compatible with other run-times, like winit's event loop?

@xq-tec
Copy link
Author

xq-tec commented Sep 10, 2021

Thanks, @kvark, @grovesNL!

why is this limited to buffer mapping? There are other async operations, like on_submitted_work_done.

Because I didn't think of it, to be honest. It should be easy to support Queue::on_submitted_work_done(); I'll add it to the pull request. All the other future-returning methods in wgpu are either called before device creation, or return a future from map_async(), right?

does this have to be integrated into wgpu-core at all?

I think there is one way to do it without wgpu-core. It would involve adding an Option<Arc<dyn Fn() + Send + Sync>> to either wgpu::Buffer or wgpu::backend::direct::Buffer, but 16 bytes per buffer instance won't hurt. Let me implement it, then I'll open an alternative pull request, and you can decide which version is better.

@kvark
Copy link
Member

kvark commented Sep 10, 2021

I guess I'm missing something conceptually. Why does there need to be a callback from wgpu-core at all? wgpu-core can't kick off events by itself, it only does so via maintain(). So "userspace" (from wgpu-core point of view, like wgpu-rs) can just call maintain(), what benefit does another callback have?

@xq-tec
Copy link
Author

xq-tec commented Sep 10, 2021

Why does there need to be a callback from wgpu-core at all?

Because wgpu::Buffer has no reference to the wgpu::Device it belongs to. So when you call wgpu::BufferSlice::map_async(), there's no way to call a closure stored in wgpu::Device. Only at the level of wgpu-core there exists a link from the buffer to the device, so that the buffer can trigger the closure stored in the wgpu_core::device::Device.

But I think there's a way around this: If we store the callback closure in wgpu::Device and Arc::clone() the closure for every created wgpu::Buffer, then wgpu::BufferSlice::map_async() could directly call the closure, without touching wgpu-core at all. It might even be sufficient to put the closure into wgpu::backend::direct::Device and wgpu::backend::direct::Buffer, since the web backend doesn't need to trigger any polling.

@kvark
Copy link
Member

kvark commented Sep 10, 2021

ok, why do we need to call any closures on map_async?

@xq-tec
Copy link
Author

xq-tec commented Sep 10, 2021

The closure wakes up the background thread, which then executes the equvalent of Device::poll(Maintain::Wait).

@kvark
Copy link
Member

kvark commented Sep 10, 2021

I'm quite confused. So the background thread parks itself and waits, only getting woken up on that signal. When is this signal sent? On map_async the signal is useless, because this is not when the mapping is resolved, it's when it's requested. I.e. the buffer may be used by the GPU at that point, so Poll would do nothing.

@xq-tec
Copy link
Author

xq-tec commented Sep 10, 2021

When the background thread is woken up, it executes global.device_poll(device_id, true) – note the value true for the force_wait parameter – which is equivalent to Device::poll(Maintain::Wait). So the device_poll() call blocks until the device is idle, driving all futures to resolution, including the one which was created for the map_async() call which woke up the background thread.

@kvark
Copy link
Member

kvark commented Sep 10, 2021

I see. So, if wgpu::Buffer could call it, you wouldn't even need any callback mechanism? I don't think there is any strong reason that wgpu::Buffer couldn't keep a reference to the device. It's very much doable.

There is a problem, however. If mapping buffer is always blocking the device (using the wait = true parameter), then nothing else can do anything on the device. So this code becomes problematic:

buffer.map_async(...);
// these operations are going to be blocked, since the device is busy waiting
let texture = device.create_texture(...);
queue.submit(...);

So we end up in a strange situation, where on one hand there is a background thread that's meant to provide asynchronuity. But on the other hand we are still effectively blocked on the main thread.

@xq-tec
Copy link
Author

xq-tec commented Sep 10, 2021

Wait, so when device.poll(Maintain::Wait) is called anywhere on any thread, all other methods called on that device will block on all other threads until the device is idle?
Well, shit, that would completely invalidate any background thread approach... 😞

Is there a way around this? I.e., would it be possible to wait on the "most recent fence" from one thread, but allow pushing to the queue on other threads?

@kvark
Copy link
Member

kvark commented Sep 10, 2021

I'm double-checking now. Maintain() locks the following:

  • device hub for reading. This means you can't asynchronously do any of the following (search devices.write to see yourself):
    • destroy buffers or textures (destroy != drop), or command encoders
    • unmap buffers
    • do any queue operations (submit, write buffer/texture)
  • lock the device's lifetime tracker. This means you can't do:
    • queue submission
    • creation of buffers that are mapped at creation
    • drop any resources
    • possibly other things...

I know this is highly unfortunate. The hub locking was never meant to be long term. Device polling with wait is a hack that probably needs to be removed entirely.
Also, with @pythonesque changes these lists may be reduced significantly, since there is going to be no Hub.

But as it stands now, the set of restrictions is pretty darn big, and it's very likely that the user code steps on one of them.

@xq-tec
Copy link
Author

xq-tec commented Sep 10, 2021

Thanks for the explanation, @kvark! I guess I'll wait until @pythonesque's changes land before working on this problem again.

@kvark kvark changed the title Background thread for automatic device polling [WIP] Background thread for automatic device polling Sep 10, 2021
@xq-tec
Copy link
Author

xq-tec commented Sep 14, 2021

Would it be possible to split wpgu_core Device::maintain(), so that the blocking wait is done outside the locks?
Roughly like this:

let (device_guard, mut token) = hub.devices.read(&mut token);
let mut life_tracker = self.lock_life(token);
triage_suspected(), triage_mapped() ...
// Drop locks:
drop(device_guard); drop(life_tracker);
// Blocking wait:
self.raw.wait(...)...
// Re-acquire locks:
let (device_guard, mut token) = hub.devices.read(&mut token);
let mut life_tracker = self.lock_life(token);
triage_submissions(), handle_mapping(), etc.
return closures;

This would solve the loss of asynchronicity due to device.poll(), but I'm not familiar enough with wgpu to understand whether the temporary unlock will lead to wrong behavior in triage_submissions() and handle_mapping().

@kvark
Copy link
Member

kvark commented Sep 14, 2021

self.raw.wait(...)

this raw is inside Device, which requires to be locked. So it's not trivial

Although technically we could make wgpu-hal expose some way of producing an independent "Waiter" object that one could use.

Ideally, we'd get @pythonesque changes instead

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants