Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUDIO_WORKLET] Optimise the copy back from wasm's heap to JS #22753

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

cwoffenden
Copy link
Contributor

@cwoffenden cwoffenden commented Oct 16, 2024

This builds on #22741 just because that's where I was at, but it's not required. The interesting changes are in audio_worklet.js and I'd appreciate some feedback from @juj before tidying this up (with sanity checks and a fallback).

Since we pass in the stack for the worklet from the caller's heap, its address shouldn't change. And since the (I'll make myself say it) render quantum size doesn't change after the audio worklet creation, the stack positions for the audio buffers should not change either. So, we can create one-time subarray views and replace the float-by-float copy with a simple set() per channel (per output).

I've thrown simple tests at it at and it works, fulfilling the garbage-free requirement and theoretically having a nice performance boost (not measured, but looping over thousands of JS Number types and shuffling them to and from floats must come at a cost). If the outputList does change, then it should only change after changes to the audio chain, which would be expensive enough that changing the subarrays wouldn't make a difference.

To be extra sure, we can move the output buffers to the first entries on the stack, then simple additional changes like input buffers won't change the address.

It wants sanity checks here and there but I'd like feedback for anything I'm missing or misunderstanding. Thanks!

@cwoffenden cwoffenden changed the title [AUDIO_WORKET] Optimise the copy back from wasm's heap to JS [AUDIO_WORKLET] Optimise the copy back from wasm's heap to JS Oct 16, 2024
@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 2 times, most recently from 38203f0 to 24a90e4 Compare October 17, 2024 13:01
@juj
Copy link
Collaborator

juj commented Oct 17, 2024

Nice idea! It looks like this PR carries the rename from the other PR. Rebase/merge should hide it?

src/audio_worklet.js Outdated Show resolved Hide resolved
src/audio_worklet.js Outdated Show resolved Hide resolved
@cwoffenden
Copy link
Contributor Author

Nice idea!

A shower thought, as all good ideas are!

It looks like this PR carries the rename from the other PR. Rebase/merge should hide it?

Yes, this one was rebased off the earlier one so should have the same commit IDs.

src/audio_worklet.js Outdated Show resolved Hide resolved
@cwoffenden
Copy link
Contributor Author

cwoffenden commented Oct 21, 2024

EDIT: every test I tried suggests no, the stack is only used once we hit Wasm in the callback.

@juj or someone who knows more about the internals of this than me: on entering AudioWorkletProcessor.process() nothing else will have used the stack beforehand? I've verified stack top is always the same when we enter the AudioWorkletProcessor's contructor and process() and it's always the same as the the address (plus length) passed to emscripten_start_wasm_audio_worklet_thread_async().

To rephrase: unless the audio worklet explicitly uses the stack functions from JS, nothing external from Wasm will before callbackFunction() is called?

@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 3 times, most recently from dcdcea1 to 5b65dcf Compare October 26, 2024 10:06
@cwoffenden
Copy link
Contributor Author

Some notes: lots of experiments with the stack allocations, minimum sizes, various flags (STACK_OVERFLOW_CHECK and ASSERTIONS) and we have some very happy code! View allocations are all in the constructor, with sanity checks performed on the address.

Next is to benchmark it.

@cwoffenden
Copy link
Contributor Author

cwoffenden commented Oct 29, 2024

Benchmarks of the main part of the audio copy done:

https://wip.numfum.com/cw/2024-10-29/index.html

Testing on my M2 Mac Studio in Chrome and Safari this PR is around 15x faster on the float copy, e.g. the original being 0.625µs per process() and the new code 0.044µs. On my aging iPhone 11 it's 1.182µs before and 0.199µs after.

Firefox on Mac has the largest difference: 5.090µs before and 0.123µs after, a 41x speed-up. Updating Firefox fixes this and it's now 0.916µs before and 0.118µs after.

@juj if we're in agreement that the simplified standalone JavaScript test code is doing the right thing (it's short and a copy and paste), I can gather numbers from regular hardware (we have a wall of Chromebooks at work).

Next I'll need to create tests to show that this still works with various input and output configs.

EDIT: a 7-12x speed-up seems typical on x64 Windows or Linux.

@cwoffenden
Copy link
Contributor Author

cwoffenden commented Oct 30, 2024

Note to me: look at #22808

@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 2 times, most recently from 069f7a4 to ae0e8bf Compare October 31, 2024 22:49
@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 6 times, most recently from f6153e9 to cccece4 Compare November 8, 2024 06:06
@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 2 times, most recently from b3dc2ef to ac37140 Compare November 14, 2024 19:14
The code has also been brought back closer to the original for comparison.
The initial stackAlloc() is overflowing, seeming to need more space so this is accounted for.
Tested with various stack sizes, output sizes, and generators.
The assertions should now cover all cases of changes in address and size of the output views.
(Off home!)
Rough implementation to see what needs doing in JS vs Wasm.
The tests pass the audio context in a void* for convenience, which needs shortening/widening for 64-bit pointers.
@cwoffenden
Copy link
Contributor Author

I ran the loop test I wrote on some recent Chromebooks, and on a middle-of-the-road ARM model I'm seeing an 11x speed-up (e.g. 1.818µs before, and 0.160µs after). Same 7-12x range improvement on other devices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants