Bitmask lookup table based supersampling #64

ishitatsuyuki · 2021-02-16T15:18:57Z

Feb 2022 direction update: this PR used to struggle with its performance regression, but there's a sparse approach in development that would alleviate much of that impact. This PR remains open for reference, and will be probably redone once the sprase development is complete.

This PR implements a new approach to antialiasing based on Monte Carlo point sampling. With this approach each path is tested for visibility against 32 sample points per pixel (as opposed to calculating them in a continuous spatial domain). The implementation uses a lookup table so that the visibility can be determined in a batch without performing implicit tests for each bit (which costs 32x).

The appealing part of this approach is that it can eliminate all conflation artifacts (#49) when combined with front-to-back drawing. When no transparency is involved, we only need to track 1. the sum of the values drawn so far, and 2. the area that is covered so far. Since point sampling accurately determine whether a point is covered or not covered without ambiguity, the conflation artifacts are eliminated.

Transparent objects are moved to the end of the command list in coarse raster and they draw back to front just as before. The stencil bitmasks are stored to properly deal with occlusion. Transparent objects are not conflation free, unfortunately.

The clipping implementation was also changed to a bitmask-based one, replacing the old approach which resembled a composited layer. In the future, when implementing effects that actually need composition, we need to reimplement it independent of clipping.

The fine raster code is quite compute heavy right now and it causes a 2x--3x slowdown (in fine raster) compared to the previous implementation. It might be possible to further optimize this by changing the contents of TileSeg, moving the heavyweight vector normalization to path_coarse where it can be done potentially more cheaply.

TODO:

Strokes
Implementing joins is currently hard as TileSeg are stored out of order and don't carry any context with them. It might take some refactor to implement this.

Original description

As an early prototype I've implemented just area calculation based on bitmasks to have an idea of how this works and how well it performs.

Profiling the code showed a ~2x slowdown (~110us vs ~220us). For some insane reason Radeon GPU Profiler now refuses to show me an instruction-level profile so I have no detailed insights on this. Perhaps the size of lookup table can be tuned as well as the CHUNK size.

As a side comment I'm having a rather stressful time dealing with piet-gpu-hal since it has two layers of abstraction which makes modifying the API a little bit annoying. I wonder if replacing it with something like vulkano is a good idea (although it doesn't support variable sized descriptors which is used for image sampling in k4).

raphlinus · 2021-02-16T15:51:47Z

On the specific question of piet-gpu-hal, yes, it adds friction. I see three ways it can go:

Get rid of it and write ash.
Replace it with wgpu.
Actually implement the dx12 and metal backends. I have some progress on dx12 locally.

The sampling work is certainly interesting, but I haven't dug into it in detail (and I'm struggling with time management right now). It's slightly discouraging that there's a slowdown, but it might still be acceptable if it solves the conflation artifacts. In any case, I'm very happy you're exploring this, as I think it's one of the more interesting questions.

eliasnaur · 2021-02-16T18:16:46Z

piet-gpu/shader/kernel4.comp

+        // mask ^= 0;
+    } else {
+        vec2 tex_coord = (0.5 - 0.5 * c) * n + vec2(0.5, 0.5);
+        uvec4 tex = texture(winding_lut, tex_coord);


Can this be replaced with imageLoads?

Texture sampling from compute programs is relatively slow on lower end hardware, probably because of the unpredictable sampling coordinates. Also, compiling kernel4 for CPUs through something like SwiftShader is much harder with texture sampling. See my #63 that replaces texture(Grad) with imageLoad.

Yeah, it surely makes sense to replace this with an image. Though due to the piet-gpu-hal mess this would mean partially applying #63 which I don't want to do right now. I'll probably make the switch when it's easy enough to do.

ishitatsuyuki · 2021-02-26T07:03:00Z

Here's a rough version of front-to-back drawing. It still doesn't handle opacity yet, but the plan is to reorder those in coarse raster so that transparent objects are drawn back to front after all opaque objects are drawn. In fine raster I plan to have a save/restore operation for stencil masks so that we can still be somewhat accurate when drawing those alpha objects.

I also managed to get a profile. The workload has become mostly bound on the latency of LUT sampling, and the ALU utilization is down to 55% (from 80%). Not particularly good or bad, but it's something to keep in mind.

ishitatsuyuki · 2021-03-12T14:56:52Z

Here's the version with back-to-front drawing for transparent objects. It works by using something like a gap buffer in coarse raster. The way it's currently implemented is rather ugly, but I think it is the least complicated data structure we can use to achieve the purpose.

I think I'm going to implement basic stroking next. I plan to do it by tessellating into quadrilaterals.

ishitatsuyuki · 2021-03-16T09:39:42Z

Pushed two changes (along with a cherry-pick of #73).

The first one is the fix for ILP as mentioned in my Zulip checkin. Turns out that the compiler is pretty wise so I managed to reduce the change to a few lines.

The second one changes the sampling pattern to align with what is widely used for MSAA. (Patterns beyond 16x are not defined in the de-facto DX11 spec though, so I generated a reasonable one with my own script.) On average it slightly improves antialiasing quality, but it's weak at diagonal patterns.

ishitatsuyuki · 2021-03-17T09:52:18Z

Did a rebase and implemented the most basic part of stroking. On the method of stroking, I had a few choices:

Perform a full-blown stroke-to-path, handling the two parallel curves independently, and also calculate and directly approximate the evolute when necessary (see Neh20 fig. 11). This yields the least segments, but it involves a lot of research problems around curve flattening and is painful to implement (or even just figuring out).
Convert the strokes to fill through tessellation. It's an approach more similar to Polar Stroking. It handles evolutes with a more kinda brute force approach, although I think the accuracy will be fine. The visibility rule is an issue though; you need to decompose self-intersecting quads so that they work with non-zero filling. It also amplifies the segments by 4x, which is... duh.
(Currently implemented) Keep the current path_coarse logic and handle stroking using implicit tests. I opted for this because it can be slightly optimized more using custom filling rules. This approach also allows straightforward access to arc length, which is needed for decoration. Currently I think it works pretty well, but it might have disadvantages compared to stroke-to-path when handling very bold/zoomed strokes. Also the approximation error does not take into account the stroke width; we can revisit that in future.

I look forward to implement joins next.

ishitatsuyuki · 2021-03-17T15:56:40Z

Turns out that profiling on RADV works, and I went identifying the bottlenecks. If you have an AMD GPU, all you need to set is: RADV_THREAD_TRACE_PIPELINE=1 RADV_THREAD_TRACE_BUFFER_SIZE=40000000 RADV_THREAD_TRACE=500 (replace 500 with the frame number, I use 500 for warmup).

Well... There wasn't any particular bottleneck, and the compiler has also been optimizing fairly well. At this point it was simply compute bound. There was a low hanging fruit around the use of lookup tables for vertical rays, which I have implemented with the latest commit. It improved the performance probably around 10% to 20%.

There are probably a lot of other peephole-like optimizations I haven't explored yet, and since the winding test code is very hot, they can easily improve the performance by a few percent. Though, for now, I have found that this bitmask approach is 2--3x slower than exact area computation.

ishitatsuyuki · 2021-03-30T07:29:52Z

Still working on rebasing this. I'm also thinking about an alternative approach of implementing front-to-back drawing without modifications to render_ctx, to allow easier switching/experiments with other approaches.

Many thanks to Venemo on #dri-devel for helping out with intrusction level parallelism. The tiles can now divided in both x and y direction, which allows for: - Less divergence on typical inputs - Faster texture stores through coalescing

ishitatsuyuki · 2021-04-04T13:29:16Z

I squashed everything into one commit and rebased.

While testing, I've noticed that tile-based occlusion culling does not really do any speedup; it's strange but probably reasonable since the occluded objects tend to be simple (e.g. background shapes which is just a solid fill).

ishitatsuyuki · 2021-04-10T16:08:02Z

I have begun some new experiments with this idea in the lut3 branch. It works extremely well on text content (like paper-1 and paper-2 from MPVG), where 99% of the pixels can be handled with the two-area approach. I think this is an appealing characteristic for use cases like text with gamma-blended opacity, commonly seen in some design systems like Material. For general vector graphics content it's neither good or bad.

Multisampling isn't implemented right now, but the goal is to bring decent multisampling within around 20--30% of overhead, and I have an idea for changing multisampling factor dynamically to bound computation time. Stay tuned.

Meanwhile, a few excuses on why the PR is half-stalled:

Front-to-back drawing is the key element of this approach but I still can't figure out how to incorporate it into coarse shader without messing the structure up too much.
I want to implement stroke joins but again, I need to plumb the normal vectors from adjacent line fragments, which takes some work.

Performing the normalization in path_coarse.comp reduces redundant computation as each thread in SIMD computes for a different path. Gives a slight overall performance boost. The normal vector direction was flipped to simplify logic.

Inspired by the original kernel4 code, this change avoids the expensive lookup whenever it's determined that the lookup result would be discarded (bit AND against a zero). It's unfortunate that we end up doing a lot of redundant work in kernel4, but at least we gain up to 10% of improvement for now.

raphlinus · 2023-02-03T14:33:53Z

Closed as this is not mergeable, and with thanks. I hope to use ideas and implementation details of this PR as part of the implementation work of #270.

raphlinus · 2023-02-18T16:52:34Z

I'm researching the best way to generate subsample masks as part of the #270 work, and found it useful to visualize the contents of the mask LUT. Here's an image:

And here's the JS code to generate it:

<html>
    <style>
        rect {
            stroke: #040;
            fill: #cec;
            stroke-width: 0.5;
        }
        circle.off {
            fill: #fff;
        }
        circle.on {
            fill: #000;
        }
    </style>
    <svg id="s" width="700" height="500">
    </svg>
<script>
const svgNS = "http://www.w3.org/2000/svg";
const size = 20;
const gap = 5;
const n_grid = 16;
const sobel = [
    [0.015625, 0.015625],
    [0.515625, 0.515625],
    [0.765625, 0.265625],
    [0.265625, 0.765625],
    [0.390625, 0.390625],
    [0.890625, 0.890625],
    [0.640625, 0.140625],
    [0.140625, 0.640625],
    [0.203125, 0.328125],
    [0.703125, 0.828125],
    [0.953125, 0.078125],
    [0.453125, 0.578125],
    [0.328125, 0.203125],
    [0.828125, 0.703125],
    [0.578125, 0.453125],
    [0.078125, 0.953125],
    [0.109375, 0.484375],
    [0.609375, 0.984375],
    [0.859375, 0.234375],
    [0.359375, 0.734375],
    [0.484375, 0.109375],
    [0.984375, 0.609375],
    [0.734375, 0.359375],
    [0.234375, 0.859375],
    [0.171875, 0.171875],
    [0.671875, 0.671875],
    [0.921875, 0.421875],
    [0.421875, 0.921875],
    [0.296875, 0.296875],
    [0.796875, 0.796875],
    [0.546875, 0.046875],
    [0.046875, 0.546875]
];
const s = document.getElementById('s');
for (let j = 0; j < n_grid; j++) {
    for (let i = 0; i < n_grid; i++) {
        const x0 = gap + i * (size + gap);
        const y0 = gap + j * (size + gap);
        let rect = document.createElementNS(svgNS, 'rect');
        rect.setAttribute('x', x0);
        rect.setAttribute('y', y0);
        rect.setAttribute('width', size);
        rect.setAttribute('height', size);
        s.appendChild(rect);

        const x = (i + 0.5) / n_grid;
        const y = (j + 0.5) / n_grid;
        const dvec = [x - 0.5, y - 0.5];
        const len = Math.hypot(dvec[0], dvec[1]);
        const n = [dvec[0] / len, dvec[1] / len];
        const c = 1 - 2 * len;
        for (let xy of sobel) {
            const z = n[0] * (xy[0] - 0.5) + n[1] * (xy[1] - 0.5) > c;
            let circ = document.createElementNS(svgNS, 'circle');
            circ.setAttribute('cx', x0 + xy[0] * size);
            circ.setAttribute('cy', y0 + xy[1] * size);
            circ.setAttribute('r', 1);
            circ.classList.add(z ? 'on' : 'off');
            s.appendChild(circ);
        }
    }
}
</script>
</html>

eliasnaur reviewed Feb 16, 2021

View reviewed changes

ishitatsuyuki force-pushed the lut branch from 1086530 to b908a32 Compare March 14, 2021 07:18

ishitatsuyuki force-pushed the lut branch from e9e35f5 to b4c88be Compare March 17, 2021 09:38

ishitatsuyuki mentioned this pull request Mar 23, 2021

kernel4: Create chunks over the x axis in addition to y axis #76

Merged

Bitmask lookup table based supersampling

2057f9c

Many thanks to Venemo on #dri-devel for helping out with intrusction level parallelism. The tiles can now divided in both x and y direction, which allows for: - Less divergence on typical inputs - Faster texture stores through coalescing

ishitatsuyuki force-pushed the lut branch from bfd8048 to 2057f9c Compare April 4, 2021 13:25

ishitatsuyuki added 3 commits April 5, 2021 14:05

Bugfixes

722ac4f

Optimization

777dd7d

Fix binding index for images

16177b4

ishitatsuyuki force-pushed the lut branch 2 times, most recently from 1bf304d to e8cb560 Compare April 11, 2021 11:54

Move vector normalization to path_coarse.comp

d6564ab

Performing the normalization in path_coarse.comp reduces redundant computation as each thread in SIMD computes for a different path. Gives a slight overall performance boost. The normal vector direction was flipped to simplify logic.

ishitatsuyuki force-pushed the lut branch from e8cb560 to d6564ab Compare April 11, 2021 13:05

ishitatsuyuki added 2 commits April 11, 2021 22:08

Encode premultiplied alpha in render_ctx.rs

ef8fb90

ishitatsuyuki mentioned this pull request Feb 9, 2022

Hierarchical fine raster #148

Closed

raphlinus mentioned this pull request Feb 2, 2023

Plan for multisampled path rendering #270

Open

raphlinus closed this Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitmask lookup table based supersampling #64

Bitmask lookup table based supersampling #64

ishitatsuyuki commented Feb 16, 2021 •

edited

Loading

raphlinus commented Feb 16, 2021

eliasnaur Feb 16, 2021

ishitatsuyuki Feb 17, 2021

ishitatsuyuki commented Feb 26, 2021

ishitatsuyuki commented Mar 12, 2021

ishitatsuyuki commented Mar 16, 2021

ishitatsuyuki commented Mar 17, 2021

ishitatsuyuki commented Mar 17, 2021

ishitatsuyuki commented Mar 30, 2021

ishitatsuyuki commented Apr 4, 2021

ishitatsuyuki commented Apr 10, 2021 •

edited

Loading

raphlinus commented Feb 3, 2023

raphlinus commented Feb 18, 2023

Bitmask lookup table based supersampling #64

Bitmask lookup table based supersampling #64

Conversation

ishitatsuyuki commented Feb 16, 2021 • edited Loading

raphlinus commented Feb 16, 2021

eliasnaur Feb 16, 2021

Choose a reason for hiding this comment

ishitatsuyuki Feb 17, 2021

Choose a reason for hiding this comment

ishitatsuyuki commented Feb 26, 2021

ishitatsuyuki commented Mar 12, 2021

ishitatsuyuki commented Mar 16, 2021

ishitatsuyuki commented Mar 17, 2021

ishitatsuyuki commented Mar 17, 2021

ishitatsuyuki commented Mar 30, 2021

ishitatsuyuki commented Apr 4, 2021

ishitatsuyuki commented Apr 10, 2021 • edited Loading

raphlinus commented Feb 3, 2023

raphlinus commented Feb 18, 2023

ishitatsuyuki commented Feb 16, 2021 •

edited

Loading

ishitatsuyuki commented Apr 10, 2021 •

edited

Loading