-
-
Notifications
You must be signed in to change notification settings - Fork 35.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebGPURenderer: Workgroup Arrays and Barrier Support #29192
Conversation
📦 Bundle sizeFull ESM build, minified and gzipped.
🌳 Bundle size after tree-shakingMinimal build including a renderer, camera, empty scene, and dependencies.
|
I'm also not really a fan of the ScopedArrayNode name. Willing to take any suggestions on a name that would make more sense ( ComputeArrayNode, ComputeLocalArrayNode, ComputeAccess, etc) |
I’m not a big fan of using "Scope" in a node name either. How about For example: export const workgroupArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Workgroup', type, count ) );
export const privateArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Private', type, count ) ); By the way are you planning on trying your WebGPU SPH Simulation with TSL @cmhhelgeson? 😄 |
Naming: I'll change the name to WorkgroupInfoNode. I'll also remove privateArray for now. I don't really see it's utility when WGSLNodeBuilder already constructs all code within one function body. However, I'll leave the scope property just in case there are other potential workgroup local variable types ( var, etc ). Maybe down the line, we can decide whether we want to rename the class if we create a separate class for workgroup variables holding a single value. Future Plans My current plan is:
So TLDR: Yes 😊 |
Would it be complicated to have an example using these features in this PR? |
Shouldn't be too complicated, I can write one using invocationLocalIndex. |
a4fd875
to
277282d
Compare
Moved to draft until samples are created. |
277282d
to
87f076c
Compare
87f076c
to
c799910
Compare
@sunag The WebGPUBackend side of the Storage buffer sample is now fixed with the addition of a single workgroupBarrier() call. This call prevents data from being accessed and written to at the same time. This is separate from the addition of a new sample that will complete this pull request. three.js.examples.-.Google.Chrome.2024-08-27.15-57-26.mp4 |
6720287
to
7dd9327
Compare
45545dc
to
3d8aa0d
Compare
Just wanted to give a brief update since this took longer than originally expected. Two things have happened:
|
Sort is now working: Untitled.video.7.mp4 |
0425ddf
to
a5df40c
Compare
@cmhhelgeson Looks great! I will review and merge it soon, thanks |
Right now the example mixes local (workgroup) and global swaps, but it might be worth considering completing all the local sorting first before moving on to global sorting. This could better reflect how bitonic sort is typically optimized for parallel processing: Phase 1: Perform all local (workgroup) swaps (flip and disperse) within each group. This approach could make the example easier to understand by clearly separating the local and global phases, making the sorting process more educational. Just a thought! |
Maybe I'm misunderstanding your suggestion, but the example already does this. It will peform only local swaps until the span of a swap necessitates that the swap be performed globally. The purpose of the computeAlgo function is to ensure that the correct swap function is executed given the span length. It's not mixing global and local swaps at random. EDIT: For instance, in the debug panel of the reference implementation, whose code I've ported into TSL, Next Step will always be a Local step until the Next Swap Span exceeds workgroup_size * 2: https://webgpu.github.io/webgpu-samples/sample/bitonicSort |
Oh I see, the two panels were misleading me. Then all good! I guess having just the left example could be easier to understand? Since the local already swaps to global when needed, demonstrating both features. |
Maybe we could color code local amd global swaps somehow. EDIT: There are probably further performance improvements that could come down the line with the implementation of switch statements, but for now, I'll try to determine a way to make to indicate the locality of a sort in an interesting visual manner before closing this out. |
…wap clearer. May want to improve the performance of the fragment shader by writing nextAlgo and nextBlockHeight to uniforms on the CPU side
@RenaudRohlinger The example has been updated to more clearly demonstrate when a local sort or when a global sort is occurring. |
Probably the best example of a sorting algorithm I've ever seen! 😄 |
Is there anything else that needs to be done here? |
* init * barrier, private array, workgroup array support * clean * Implement Renaud suggestions * fix * fix storage buffer example with workgroupBarrier() * add tags and other info * add bitonic sort example * update * Rebase branch * try to fix bitonic sort shader * simplify * fix * bitonic sort now works but local swap is slower than global swap : * cleanup * fix rebase issues * Change display and html to make difference between global and local swap clearer. May want to improve the performance of the fragment shader by writing nextAlgo and nextBlockHeight to uniforms on the CPU side * update (ugly?) screenshot * cleanup ---------
* init * barrier, private array, workgroup array support * clean * Implement Renaud suggestions * fix * fix storage buffer example with workgroupBarrier() * add tags and other info * add bitonic sort example * update * Rebase branch * try to fix bitonic sort shader * simplify * fix * bitonic sort now works but local swap is slower than global swap : * cleanup * fix rebase issues * Change display and html to make difference between global and local swap clearer. May want to improve the performance of the fragment shader by writing nextAlgo and nextBlockHeight to uniforms on the CPU side * update (ugly?) screenshot * cleanup ---------
Description
Add ability to create workgroup and private arrays within compute shaders, which can be used to accelerate compute operations. Ideally could be used for providing pre-written compute operations that are fast and useful out of the box ( bitonic sort, prefix sum ). This would probably be less useful to the end user, though those only targeting WebGPU devices may find some benefit out of using this functionality.
If requested, I can try to provide samples for some of this functionality, like porting the existing WebGPU Bitonic sort sample or doing something with spatial hashing and prefix sums, though this will likely require the ability to query the value of local_invocation_id and workgroup_id within TSL.