Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGPURenderer: Workgroup Arrays and Barrier Support #29192

Merged
merged 19 commits into from
Sep 10, 2024

Conversation

cmhhelgeson
Copy link
Contributor

@cmhhelgeson cmhhelgeson commented Aug 20, 2024

Description

Add ability to create workgroup and private arrays within compute shaders, which can be used to accelerate compute operations. Ideally could be used for providing pre-written compute operations that are fast and useful out of the box ( bitonic sort, prefix sum ). This would probably be less useful to the end user, though those only targeting WebGPU devices may find some benefit out of using this functionality.

If requested, I can try to provide samples for some of this functionality, like porting the existing WebGPU Bitonic sort sample or doing something with spatial hashing and prefix sums, though this will likely require the ability to query the value of local_invocation_id and workgroup_id within TSL.

  • Node implementation
  • Storage Buffer Sample Fix
  • Bespoke Workgroup Array Sample

Copy link

github-actions bot commented Aug 20, 2024

📦 Bundle size

Full ESM build, minified and gzipped.

Before After Diff
WebGL 685.24
169.64
685.24
169.64
+0 B
+0 B
WebGPU 826.41
221.63
827.96
222.09
+1.55 kB
+466 B
WebGPU Nodes 825.99
221.54
827.54
222
+1.55 kB
+461 B

🌳 Bundle size after tree-shaking

Minimal build including a renderer, camera, empty scene, and dependencies.

Before After Diff
WebGL 462.02
111.48
462.02
111.48
+0 B
+0 B
WebGPU 525.5
141.64
526.17
141.84
+671 B
+193 B
WebGPU Nodes 482.15
131.46
482.83
131.65
+671 B
+192 B

@cmhhelgeson
Copy link
Contributor Author

I'm also not really a fan of the ScopedArrayNode name. Willing to take any suggestions on a name that would make more sense ( ComputeArrayNode, ComputeLocalArrayNode, ComputeAccess, etc)

@RenaudRohlinger
Copy link
Collaborator

I’m not a big fan of using "Scope" in a node name either. How about WorkgroupInfoNode? It clearly indicates workgroup-level data and aligns with WebGPU and WGSL terminology. Or maybe DomainArrayNode?

For example:

export const workgroupArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Workgroup', type, count ) );
export const privateArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Private', type, count ) );

By the way are you planning on trying your WebGPU SPH Simulation with TSL @cmhhelgeson? 😄

@cmhhelgeson
Copy link
Contributor Author

cmhhelgeson commented Aug 22, 2024

I’m not a big fan of using "Scope" in a node name either. How about WorkgroupInfoNode? It clearly indicates workgroup-level data and aligns with WebGPU and WGSL terminology. Or maybe DomainArrayNode?

For example:

export const workgroupArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Workgroup', type, count ) );
export const privateArray = ( type, count ) => nodeObject( new WorkgroupInfoNode( 'Private', type, count ) );

By the way are you planning on trying your WebGPU SPH Simulation with TSL @cmhhelgeson? 😄

Naming:

I'll change the name to WorkgroupInfoNode. I'll also remove privateArray for now. I don't really see it's utility when WGSLNodeBuilder already constructs all code within one function body. However, I'll leave the scope property just in case there are other potential workgroup local variable types ( var, etc ). Maybe down the line, we can decide whether we want to rename the class if we create a separate class for workgroup variables holding a single value.

Future Plans

My current plan is:

  1. Get current pull requests merged in
  2. Finish protoplanet port ( people seem to like ports of old samples and InstancedPointsNodeMaterial needs to be fixed)
  3. Finish extant pull requests that can be finished (post-processing, arrayCamera, etc)
  4. Do required reading on subgroups, physics, workgroup sync, maybe atomics for a week to get myself back up to speed.
  5. Move onto new/more creative/more demanding uses of compute like SPH, Instanced Points FLIP, Spatial Hash Collisions with Workgroup or Subgroup Sync, etc.

So TLDR: Yes 😊

@sunag
Copy link
Collaborator

sunag commented Aug 24, 2024

Would it be complicated to have an example using these features in this PR?

@cmhhelgeson
Copy link
Contributor Author

Would it be complicated to have an example using these features in this PR?

Shouldn't be too complicated, I can write one using invocationLocalIndex.

@cmhhelgeson cmhhelgeson marked this pull request as draft August 26, 2024 16:39
@cmhhelgeson
Copy link
Contributor Author

Moved to draft until samples are created.

src/nodes/Nodes.js Fixed Show fixed Hide fixed
@cmhhelgeson
Copy link
Contributor Author

cmhhelgeson commented Aug 27, 2024

@sunag The WebGPUBackend side of the Storage buffer sample is now fixed with the addition of a single workgroupBarrier() call. This call prevents data from being accessed and written to at the same time. This is separate from the addition of a new sample that will complete this pull request.

three.js.examples.-.Google.Chrome.2024-08-27.15-57-26.mp4

@cmhhelgeson
Copy link
Contributor Author

cmhhelgeson commented Sep 3, 2024

Just wanted to give a brief update since this took longer than originally expected. Two things have happened:

  1. Moving for a new job, so haven't had the time to give PRs proper attention.
  2. Sort seems to work under certain conditions but there's some weirdness with the uniforms that I haven't been able to figure out yet. I'll add more detail below once I've figured out what the exact issue is.

@cmhhelgeson
Copy link
Contributor Author

Sort is now working:

Untitled.video.7.mp4

@cmhhelgeson cmhhelgeson marked this pull request as ready for review September 4, 2024 06:02
@sunag sunag added this to the r169 milestone Sep 5, 2024
@sunag
Copy link
Collaborator

sunag commented Sep 6, 2024

@cmhhelgeson Looks great! I will review and merge it soon, thanks

@RenaudRohlinger
Copy link
Collaborator

Right now the example mixes local (workgroup) and global swaps, but it might be worth considering completing all the local sorting first before moving on to global sorting.

This could better reflect how bitonic sort is typically optimized for parallel processing:

Phase 1: Perform all local (workgroup) swaps (flip and disperse) within each group.
Phase 2: Once the local sorting is done, proceed to global sorting across workgroups to finalize the order.

This approach could make the example easier to understand by clearly separating the local and global phases, making the sorting process more educational. Just a thought!

@cmhhelgeson
Copy link
Contributor Author

cmhhelgeson commented Sep 6, 2024

Right now the example mixes local (workgroup) and global swaps, but it might be worth considering completing all the local sorting first before moving on to global sorting.

This could better reflect how bitonic sort is typically optimized for parallel processing:

Phase 1: Perform all local (workgroup) swaps (flip and disperse) within each group. Phase 2: Once the local sorting is done, proceed to global sorting across workgroups to finalize the order.

This approach could make the example easier to understand by clearly separating the local and global phases, making the sorting process more educational. Just a thought!

Maybe I'm misunderstanding your suggestion, but the example already does this. It will peform only local swaps until the span of a swap necessitates that the swap be performed globally. The purpose of the computeAlgo function is to ensure that the correct swap function is executed given the span length. It's not mixing global and local swaps at random.

EDIT: For instance, in the debug panel of the reference implementation, whose code I've ported into TSL, Next Step will always be a Local step until the Next Swap Span exceeds workgroup_size * 2: https://webgpu.github.io/webgpu-samples/sample/bitonicSort

@RenaudRohlinger
Copy link
Collaborator

Oh I see, the two panels were misleading me. Then all good! I guess having just the left example could be easier to understand? Since the local already swaps to global when needed, demonstrating both features.

@cmhhelgeson
Copy link
Contributor Author

cmhhelgeson commented Sep 6, 2024

Oh I see, the two panels were misleading me. Then all good! I guess having just the left example could be easier to understand? Since the local already swaps to global when needed, demonstrating both

Maybe we could color code local amd global swaps somehow.

EDIT: There are probably further performance improvements that could come down the line with the implementation of switch statements, but for now, I'll try to determine a way to make to indicate the locality of a sort in an interesting visual manner before closing this out.

…wap clearer. May want to improve the performance of the fragment shader by writing nextAlgo and nextBlockHeight to uniforms on the CPU side
@cmhhelgeson
Copy link
Contributor Author

cmhhelgeson commented Sep 8, 2024

@RenaudRohlinger The example has been updated to more clearly demonstrate when a local sort or when a global sort is occurring.

@RenaudRohlinger
Copy link
Collaborator

Probably the best example of a sorting algorithm I've ever seen! 😄

@cmhhelgeson
Copy link
Contributor Author

Is there anything else that needs to be done here?

@sunag sunag merged commit 1174d07 into mrdoob:dev Sep 10, 2024
12 checks passed
@cmhhelgeson cmhhelgeson deleted the workgroup_array_node branch September 10, 2024 18:38
LD2Studio pushed a commit to LD2Studio/LD2Studio-Editor that referenced this pull request Sep 13, 2024
* init

* barrier, private array, workgroup array support

* clean

* Implement Renaud suggestions

* fix

* fix storage buffer example with workgroupBarrier()

* add tags and other info

* add bitonic sort example

* update

* Rebase branch

* try to fix bitonic sort shader

* simplify

* fix

* bitonic sort now works but local swap is slower than global swap :

* cleanup

* fix rebase issues

* Change display and html to make difference between global and local swap clearer. May want to improve the performance of the fragment shader by writing nextAlgo and nextBlockHeight to uniforms on the CPU side

* update (ugly?) screenshot

* cleanup

---------
LD2Studio pushed a commit to LD2Studio/LD2Studio-Editor that referenced this pull request Sep 13, 2024
* init

* barrier, private array, workgroup array support

* clean

* Implement Renaud suggestions

* fix

* fix storage buffer example with workgroupBarrier()

* add tags and other info

* add bitonic sort example

* update

* Rebase branch

* try to fix bitonic sort shader

* simplify

* fix

* bitonic sort now works but local swap is slower than global swap :

* cleanup

* fix rebase issues

* Change display and html to make difference between global and local swap clearer. May want to improve the performance of the fragment shader by writing nextAlgo and nextBlockHeight to uniforms on the CPU side

* update (ugly?) screenshot

* cleanup

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants