-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitmask lookup table based supersampling #64
Conversation
On the specific question of piet-gpu-hal, yes, it adds friction. I see three ways it can go:
The sampling work is certainly interesting, but I haven't dug into it in detail (and I'm struggling with time management right now). It's slightly discouraging that there's a slowdown, but it might still be acceptable if it solves the conflation artifacts. In any case, I'm very happy you're exploring this, as I think it's one of the more interesting questions. |
piet-gpu/shader/kernel4.comp
Outdated
// mask ^= 0; | ||
} else { | ||
vec2 tex_coord = (0.5 - 0.5 * c) * n + vec2(0.5, 0.5); | ||
uvec4 tex = texture(winding_lut, tex_coord); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be replaced with imageLoads?
Texture sampling from compute programs is relatively slow on lower end hardware, probably because of the unpredictable sampling coordinates. Also, compiling kernel4 for CPUs through something like SwiftShader is much harder with texture sampling. See my #63 that replaces texture(Grad) with imageLoad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it surely makes sense to replace this with an image. Though due to the piet-gpu-hal mess this would mean partially applying #63 which I don't want to do right now. I'll probably make the switch when it's easy enough to do.
Here's a rough version of front-to-back drawing. It still doesn't handle opacity yet, but the plan is to reorder those in coarse raster so that transparent objects are drawn back to front after all opaque objects are drawn. In fine raster I plan to have a save/restore operation for stencil masks so that we can still be somewhat accurate when drawing those alpha objects. I also managed to get a profile. The workload has become mostly bound on the latency of LUT sampling, and the ALU utilization is down to 55% (from 80%). Not particularly good or bad, but it's something to keep in mind. |
Here's the version with back-to-front drawing for transparent objects. It works by using something like a gap buffer in coarse raster. The way it's currently implemented is rather ugly, but I think it is the least complicated data structure we can use to achieve the purpose. I think I'm going to implement basic stroking next. I plan to do it by tessellating into quadrilaterals. |
Pushed two changes (along with a cherry-pick of #73). The first one is the fix for ILP as mentioned in my Zulip checkin. Turns out that the compiler is pretty wise so I managed to reduce the change to a few lines. The second one changes the sampling pattern to align with what is widely used for MSAA. (Patterns beyond 16x are not defined in the de-facto DX11 spec though, so I generated a reasonable one with my own script.) On average it slightly improves antialiasing quality, but it's weak at diagonal patterns. |
Did a rebase and implemented the most basic part of stroking. On the method of stroking, I had a few choices:
I look forward to implement joins next. |
Turns out that profiling on RADV works, and I went identifying the bottlenecks. If you have an AMD GPU, all you need to set is: Well... There wasn't any particular bottleneck, and the compiler has also been optimizing fairly well. At this point it was simply compute bound. There was a low hanging fruit around the use of lookup tables for vertical rays, which I have implemented with the latest commit. It improved the performance probably around 10% to 20%. There are probably a lot of other peephole-like optimizations I haven't explored yet, and since the winding test code is very hot, they can easily improve the performance by a few percent. Though, for now, I have found that this bitmask approach is 2--3x slower than exact area computation. |
Still working on rebasing this. I'm also thinking about an alternative approach of implementing front-to-back drawing without modifications to render_ctx, to allow easier switching/experiments with other approaches. |
Many thanks to Venemo on #dri-devel for helping out with intrusction level parallelism. The tiles can now divided in both x and y direction, which allows for: - Less divergence on typical inputs - Faster texture stores through coalescing
I squashed everything into one commit and rebased. While testing, I've noticed that tile-based occlusion culling does not really do any speedup; it's strange but probably reasonable since the occluded objects tend to be simple (e.g. background shapes which is just a solid fill). |
I have begun some new experiments with this idea in the Multisampling isn't implemented right now, but the goal is to bring decent multisampling within around 20--30% of overhead, and I have an idea for changing multisampling factor dynamically to bound computation time. Stay tuned. Meanwhile, a few excuses on why the PR is half-stalled:
|
1bf304d
to
e8cb560
Compare
Performing the normalization in path_coarse.comp reduces redundant computation as each thread in SIMD computes for a different path. Gives a slight overall performance boost. The normal vector direction was flipped to simplify logic.
Inspired by the original kernel4 code, this change avoids the expensive lookup whenever it's determined that the lookup result would be discarded (bit AND against a zero). It's unfortunate that we end up doing a lot of redundant work in kernel4, but at least we gain up to 10% of improvement for now.
Closed as this is not mergeable, and with thanks. I hope to use ideas and implementation details of this PR as part of the implementation work of #270. |
I'm researching the best way to generate subsample masks as part of the #270 work, and found it useful to visualize the contents of the mask LUT. Here's an image: And here's the JS code to generate it: <html>
<style>
rect {
stroke: #040;
fill: #cec;
stroke-width: 0.5;
}
circle.off {
fill: #fff;
}
circle.on {
fill: #000;
}
</style>
<svg id="s" width="700" height="500">
</svg>
<script>
const svgNS = "http://www.w3.org/2000/svg";
const size = 20;
const gap = 5;
const n_grid = 16;
const sobel = [
[0.015625, 0.015625],
[0.515625, 0.515625],
[0.765625, 0.265625],
[0.265625, 0.765625],
[0.390625, 0.390625],
[0.890625, 0.890625],
[0.640625, 0.140625],
[0.140625, 0.640625],
[0.203125, 0.328125],
[0.703125, 0.828125],
[0.953125, 0.078125],
[0.453125, 0.578125],
[0.328125, 0.203125],
[0.828125, 0.703125],
[0.578125, 0.453125],
[0.078125, 0.953125],
[0.109375, 0.484375],
[0.609375, 0.984375],
[0.859375, 0.234375],
[0.359375, 0.734375],
[0.484375, 0.109375],
[0.984375, 0.609375],
[0.734375, 0.359375],
[0.234375, 0.859375],
[0.171875, 0.171875],
[0.671875, 0.671875],
[0.921875, 0.421875],
[0.421875, 0.921875],
[0.296875, 0.296875],
[0.796875, 0.796875],
[0.546875, 0.046875],
[0.046875, 0.546875]
];
const s = document.getElementById('s');
for (let j = 0; j < n_grid; j++) {
for (let i = 0; i < n_grid; i++) {
const x0 = gap + i * (size + gap);
const y0 = gap + j * (size + gap);
let rect = document.createElementNS(svgNS, 'rect');
rect.setAttribute('x', x0);
rect.setAttribute('y', y0);
rect.setAttribute('width', size);
rect.setAttribute('height', size);
s.appendChild(rect);
const x = (i + 0.5) / n_grid;
const y = (j + 0.5) / n_grid;
const dvec = [x - 0.5, y - 0.5];
const len = Math.hypot(dvec[0], dvec[1]);
const n = [dvec[0] / len, dvec[1] / len];
const c = 1 - 2 * len;
for (let xy of sobel) {
const z = n[0] * (xy[0] - 0.5) + n[1] * (xy[1] - 0.5) > c;
let circ = document.createElementNS(svgNS, 'circle');
circ.setAttribute('cx', x0 + xy[0] * size);
circ.setAttribute('cy', y0 + xy[1] * size);
circ.setAttribute('r', 1);
circ.classList.add(z ? 'on' : 'off');
s.appendChild(circ);
}
}
}
</script>
</html> |
Feb 2022 direction update: this PR used to struggle with its performance regression, but there's a sparse approach in development that would alleviate much of that impact. This PR remains open for reference, and will be probably redone once the sprase development is complete.
This PR implements a new approach to antialiasing based on Monte Carlo point sampling. With this approach each path is tested for visibility against 32 sample points per pixel (as opposed to calculating them in a continuous spatial domain). The implementation uses a lookup table so that the visibility can be determined in a batch without performing implicit tests for each bit (which costs 32x).
The appealing part of this approach is that it can eliminate all conflation artifacts (#49) when combined with front-to-back drawing. When no transparency is involved, we only need to track 1. the sum of the values drawn so far, and 2. the area that is covered so far. Since point sampling accurately determine whether a point is covered or not covered without ambiguity, the conflation artifacts are eliminated.
Transparent objects are moved to the end of the command list in coarse raster and they draw back to front just as before. The stencil bitmasks are stored to properly deal with occlusion. Transparent objects are not conflation free, unfortunately.
The clipping implementation was also changed to a bitmask-based one, replacing the old approach which resembled a composited layer. In the future, when implementing effects that actually need composition, we need to reimplement it independent of clipping.
The fine raster code is quite compute heavy right now and it causes a 2x--3x slowdown (in fine raster) compared to the previous implementation. It might be possible to further optimize this by changing the contents of TileSeg, moving the heavyweight vector normalization to path_coarse where it can be done potentially more cheaply.
TODO:
Implementing joins is currently hard as TileSeg are stored out of order and don't carry any context with them. It might take some refactor to implement this.
Original description
As an early prototype I've implemented just area calculation based on bitmasks to have an idea of how this works and how well it performs.
Profiling the code showed a ~2x slowdown (~110us vs ~220us). For some insane reason Radeon GPU Profiler now refuses to show me an instruction-level profile so I have no detailed insights on this. Perhaps the size of lookup table can be tuned as well as the CHUNK size.
As a side comment I'm having a rather stressful time dealing with piet-gpu-hal since it has two layers of abstraction which makes modifying the API a little bit annoying. I wonder if replacing it with something like vulkano is a good idea (although it doesn't support variable sized descriptors which is used for image sampling in k4).