Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparently miscompiled barrier in compute shader #3181

Open
raphlinus opened this issue Nov 6, 2022 · 4 comments
Open

Apparently miscompiled barrier in compute shader #3181

raphlinus opened this issue Nov 6, 2022 · 4 comments
Labels
api: metal Issues with Metal external: driver-bug A driver is causing the bug, though we may still want to work around it

Comments

@raphlinus
Copy link
Contributor

I'm investigating what appears to be a miscompiled shader from piet-gpu, and have generated a very reduced test case. There's more detail at linebender/vello#199 but this issue is intended to be self-contained.

Description
My shader does bump allocation with atomics in one thread (WG_SIZE - 1), then stores the result in a workgroup shared variable to broadcast to other threads. In the shader source, a workgroupBarrier() separates the write from the read. What I'm seeing is consistent with the barrier being elided altogether.

Repro steps
Check out the tree from linebender/vello#199, cd piet-wgsl, then cargo run.

Expected vs observed behavior
Correct output is as follows:

0: 1
1: 1
2: 1

Actual output appears to be garbage values (0x4a150041 is typical), at least when run on mac M1 Max.

Extra materials
I also prepared a WebGPU version of this at https://gist.github.com/raphlinus/f32d9602241ea5c99b480e721f1ea834 . It does not repro on Chrome Canary.

I've also looked at the naga output from compiling the shader and don't see anything wrong with that - the barrier is clearly there.

Some perturbations to the test make the behavior go away (for example, removing the last if gating the write), but this is common with barrier problems in my experience. Also when I perturb the test so the reader threads are in the same subgroup (or if the workgroup size is reduced to 32, which appears to be the subgroup size), the behavior goes away. If I change the test to store the value in a storage buffer, and use a storageBarrier(), the behavior goes away; this is the current workaround I'm using in piet-gpu itself.

Platform
macOS 12.6, mac M1 Max, wgpu 0.14

@raphlinus
Copy link
Contributor Author

raphlinus commented Nov 6, 2022

More digging: I've made a minimal Metal repro that triggers the behavior. It only triggers when buffer-bounds-check-policy is set to ReadZeroSkipWrite, otherwise not.

@raphlinus
Copy link
Contributor Author

Even more curious: if I replace the expression 1 + (_buffer_sizes.size1 - 0 - 4) / 4 with the hardcoded constant 256 (which it evaluates to), I don't see the behavior. If I simplify it to _buffer_sizes.size1 / 4 I still do.

@raphlinus
Copy link
Contributor Author

And more insight into why it works in Chrome Canary: the tint transform for buffer robustness uses min rather than conditionalizing the store. I have a tweak to the Metal repro case if anybody wants to see it, but I don't know that it would be particularly helpful in tracking this down.

@raphlinus
Copy link
Contributor Author

The miscompilation has been confirmed by a contact at Apple and my understanding is that they're working on it. That still leaves the question of how to mitigate it on our end. Possibilities include:

  • Keep the bletcherous workaround of using a storage buffer rather than a workgroup-shared variable.
  • Turn off bounds checking altogether, as it actually increases risk.
  • Patch naga to use min-style bounds enforcement rather than conditionalizing

I also want to make sure this case is adequately covered by tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: metal Issues with Metal external: driver-bug A driver is causing the bug, though we may still want to work around it
Projects
None yet
Development

No branches or pull requests

3 participants