Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitmask lookup table based supersampling #64

Closed
wants to merge 7 commits into from

Conversation

ishitatsuyuki
Copy link
Collaborator

@ishitatsuyuki ishitatsuyuki commented Feb 16, 2021

Feb 2022 direction update: this PR used to struggle with its performance regression, but there's a sparse approach in development that would alleviate much of that impact. This PR remains open for reference, and will be probably redone once the sprase development is complete.

This PR implements a new approach to antialiasing based on Monte Carlo point sampling. With this approach each path is tested for visibility against 32 sample points per pixel (as opposed to calculating them in a continuous spatial domain). The implementation uses a lookup table so that the visibility can be determined in a batch without performing implicit tests for each bit (which costs 32x).

The appealing part of this approach is that it can eliminate all conflation artifacts (#49) when combined with front-to-back drawing. When no transparency is involved, we only need to track 1. the sum of the values drawn so far, and 2. the area that is covered so far. Since point sampling accurately determine whether a point is covered or not covered without ambiguity, the conflation artifacts are eliminated.

Transparent objects are moved to the end of the command list in coarse raster and they draw back to front just as before. The stencil bitmasks are stored to properly deal with occlusion. Transparent objects are not conflation free, unfortunately.

The clipping implementation was also changed to a bitmask-based one, replacing the old approach which resembled a composited layer. In the future, when implementing effects that actually need composition, we need to reimplement it independent of clipping.

The fine raster code is quite compute heavy right now and it causes a 2x--3x slowdown (in fine raster) compared to the previous implementation. It might be possible to further optimize this by changing the contents of TileSeg, moving the heavyweight vector normalization to path_coarse where it can be done potentially more cheaply.

TODO:

  • Strokes
    Implementing joins is currently hard as TileSeg are stored out of order and don't carry any context with them. It might take some refactor to implement this.
Original description

As an early prototype I've implemented just area calculation based on bitmasks to have an idea of how this works and how well it performs.

Profiling the code showed a ~2x slowdown (~110us vs ~220us). For some insane reason Radeon GPU Profiler now refuses to show me an instruction-level profile so I have no detailed insights on this. Perhaps the size of lookup table can be tuned as well as the CHUNK size.

As a side comment I'm having a rather stressful time dealing with piet-gpu-hal since it has two layers of abstraction which makes modifying the API a little bit annoying. I wonder if replacing it with something like vulkano is a good idea (although it doesn't support variable sized descriptors which is used for image sampling in k4).

@raphlinus
Copy link
Contributor

On the specific question of piet-gpu-hal, yes, it adds friction. I see three ways it can go:

  • Get rid of it and write ash.
  • Replace it with wgpu.
  • Actually implement the dx12 and metal backends. I have some progress on dx12 locally.

The sampling work is certainly interesting, but I haven't dug into it in detail (and I'm struggling with time management right now). It's slightly discouraging that there's a slowdown, but it might still be acceptable if it solves the conflation artifacts. In any case, I'm very happy you're exploring this, as I think it's one of the more interesting questions.

// mask ^= 0;
} else {
vec2 tex_coord = (0.5 - 0.5 * c) * n + vec2(0.5, 0.5);
uvec4 tex = texture(winding_lut, tex_coord);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be replaced with imageLoads?

Texture sampling from compute programs is relatively slow on lower end hardware, probably because of the unpredictable sampling coordinates. Also, compiling kernel4 for CPUs through something like SwiftShader is much harder with texture sampling. See my #63 that replaces texture(Grad) with imageLoad.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it surely makes sense to replace this with an image. Though due to the piet-gpu-hal mess this would mean partially applying #63 which I don't want to do right now. I'll probably make the switch when it's easy enough to do.

@ishitatsuyuki
Copy link
Collaborator Author

Here's a rough version of front-to-back drawing. It still doesn't handle opacity yet, but the plan is to reorder those in coarse raster so that transparent objects are drawn back to front after all opaque objects are drawn. In fine raster I plan to have a save/restore operation for stencil masks so that we can still be somewhat accurate when drawing those alpha objects.

I also managed to get a profile. The workload has become mostly bound on the latency of LUT sampling, and the ALU utilization is down to 55% (from 80%). Not particularly good or bad, but it's something to keep in mind.

@ishitatsuyuki
Copy link
Collaborator Author

Here's the version with back-to-front drawing for transparent objects. It works by using something like a gap buffer in coarse raster. The way it's currently implemented is rather ugly, but I think it is the least complicated data structure we can use to achieve the purpose.

I think I'm going to implement basic stroking next. I plan to do it by tessellating into quadrilaterals.

@ishitatsuyuki
Copy link
Collaborator Author

Pushed two changes (along with a cherry-pick of #73).

The first one is the fix for ILP as mentioned in my Zulip checkin. Turns out that the compiler is pretty wise so I managed to reduce the change to a few lines.

The second one changes the sampling pattern to align with what is widely used for MSAA. (Patterns beyond 16x are not defined in the de-facto DX11 spec though, so I generated a reasonable one with my own script.) On average it slightly improves antialiasing quality, but it's weak at diagonal patterns.

@ishitatsuyuki
Copy link
Collaborator Author

Did a rebase and implemented the most basic part of stroking. On the method of stroking, I had a few choices:

  1. Perform a full-blown stroke-to-path, handling the two parallel curves independently, and also calculate and directly approximate the evolute when necessary (see Neh20 fig. 11). This yields the least segments, but it involves a lot of research problems around curve flattening and is painful to implement (or even just figuring out).
  2. Convert the strokes to fill through tessellation. It's an approach more similar to Polar Stroking. It handles evolutes with a more kinda brute force approach, although I think the accuracy will be fine. The visibility rule is an issue though; you need to decompose self-intersecting quads so that they work with non-zero filling. It also amplifies the segments by 4x, which is... duh.
  3. (Currently implemented) Keep the current path_coarse logic and handle stroking using implicit tests. I opted for this because it can be slightly optimized more using custom filling rules. This approach also allows straightforward access to arc length, which is needed for decoration. Currently I think it works pretty well, but it might have disadvantages compared to stroke-to-path when handling very bold/zoomed strokes. Also the approximation error does not take into account the stroke width; we can revisit that in future.

I look forward to implement joins next.

@ishitatsuyuki
Copy link
Collaborator Author

Turns out that profiling on RADV works, and I went identifying the bottlenecks. If you have an AMD GPU, all you need to set is: RADV_THREAD_TRACE_PIPELINE=1 RADV_THREAD_TRACE_BUFFER_SIZE=40000000 RADV_THREAD_TRACE=500 (replace 500 with the frame number, I use 500 for warmup).

Well... There wasn't any particular bottleneck, and the compiler has also been optimizing fairly well. At this point it was simply compute bound. There was a low hanging fruit around the use of lookup tables for vertical rays, which I have implemented with the latest commit. It improved the performance probably around 10% to 20%.

There are probably a lot of other peephole-like optimizations I haven't explored yet, and since the winding test code is very hot, they can easily improve the performance by a few percent. Though, for now, I have found that this bitmask approach is 2--3x slower than exact area computation.

@ishitatsuyuki
Copy link
Collaborator Author

Still working on rebasing this. I'm also thinking about an alternative approach of implementing front-to-back drawing without modifications to render_ctx, to allow easier switching/experiments with other approaches.

Many thanks to Venemo on #dri-devel for helping out with intrusction level parallelism.

The tiles can now divided in both x and y direction, which allows for:
- Less divergence on typical inputs
- Faster texture stores through coalescing
@ishitatsuyuki
Copy link
Collaborator Author

I squashed everything into one commit and rebased.

While testing, I've noticed that tile-based occlusion culling does not really do any speedup; it's strange but probably reasonable since the occluded objects tend to be simple (e.g. background shapes which is just a solid fill).

@ishitatsuyuki
Copy link
Collaborator Author

ishitatsuyuki commented Apr 10, 2021

I have begun some new experiments with this idea in the lut3 branch. It works extremely well on text content (like paper-1 and paper-2 from MPVG), where 99% of the pixels can be handled with the two-area approach. I think this is an appealing characteristic for use cases like text with gamma-blended opacity, commonly seen in some design systems like Material. For general vector graphics content it's neither good or bad.

Multisampling isn't implemented right now, but the goal is to bring decent multisampling within around 20--30% of overhead, and I have an idea for changing multisampling factor dynamically to bound computation time. Stay tuned.

Meanwhile, a few excuses on why the PR is half-stalled:

  • Front-to-back drawing is the key element of this approach but I still can't figure out how to incorporate it into coarse shader without messing the structure up too much.
  • I want to implement stroke joins but again, I need to plumb the normal vectors from adjacent line fragments, which takes some work.

@ishitatsuyuki ishitatsuyuki force-pushed the lut branch 2 times, most recently from 1bf304d to e8cb560 Compare April 11, 2021 11:54
Performing the normalization in path_coarse.comp reduces redundant
computation as each thread in SIMD computes for a different path.
Gives a slight overall performance boost.

The normal vector direction was flipped to simplify logic.
Inspired by the original kernel4 code, this change avoids the expensive lookup whenever it's determined that the lookup result would be discarded (bit AND against a zero).

It's unfortunate that we end up doing a lot of redundant work in kernel4, but at least we gain up to 10% of improvement for now.
@raphlinus
Copy link
Contributor

Closed as this is not mergeable, and with thanks. I hope to use ideas and implementation details of this PR as part of the implementation work of #270.

@raphlinus raphlinus closed this Feb 3, 2023
@raphlinus
Copy link
Contributor

I'm researching the best way to generate subsample masks as part of the #270 work, and found it useful to visualize the contents of the mask LUT. Here's an image:

Screenshot 2023-02-18 at 8 51 55 AM

And here's the JS code to generate it:

<html>
    <style>
        rect {
            stroke: #040;
            fill: #cec;
            stroke-width: 0.5;
        }
        circle.off {
            fill: #fff;
        }
        circle.on {
            fill: #000;
        }
    </style>
    <svg id="s" width="700" height="500">
    </svg>
<script>
const svgNS = "http://www.w3.org/2000/svg";
const size = 20;
const gap = 5;
const n_grid = 16;
const sobel = [
    [0.015625, 0.015625],
    [0.515625, 0.515625],
    [0.765625, 0.265625],
    [0.265625, 0.765625],
    [0.390625, 0.390625],
    [0.890625, 0.890625],
    [0.640625, 0.140625],
    [0.140625, 0.640625],
    [0.203125, 0.328125],
    [0.703125, 0.828125],
    [0.953125, 0.078125],
    [0.453125, 0.578125],
    [0.328125, 0.203125],
    [0.828125, 0.703125],
    [0.578125, 0.453125],
    [0.078125, 0.953125],
    [0.109375, 0.484375],
    [0.609375, 0.984375],
    [0.859375, 0.234375],
    [0.359375, 0.734375],
    [0.484375, 0.109375],
    [0.984375, 0.609375],
    [0.734375, 0.359375],
    [0.234375, 0.859375],
    [0.171875, 0.171875],
    [0.671875, 0.671875],
    [0.921875, 0.421875],
    [0.421875, 0.921875],
    [0.296875, 0.296875],
    [0.796875, 0.796875],
    [0.546875, 0.046875],
    [0.046875, 0.546875]
];
const s = document.getElementById('s');
for (let j = 0; j < n_grid; j++) {
    for (let i = 0; i < n_grid; i++) {
        const x0 = gap + i * (size + gap);
        const y0 = gap + j * (size + gap);
        let rect = document.createElementNS(svgNS, 'rect');
        rect.setAttribute('x', x0);
        rect.setAttribute('y', y0);
        rect.setAttribute('width', size);
        rect.setAttribute('height', size);
        s.appendChild(rect);

        const x = (i + 0.5) / n_grid;
        const y = (j + 0.5) / n_grid;
        const dvec = [x - 0.5, y - 0.5];
        const len = Math.hypot(dvec[0], dvec[1]);
        const n = [dvec[0] / len, dvec[1] / len];
        const c = 1 - 2 * len;
        for (let xy of sobel) {
            const z = n[0] * (xy[0] - 0.5) + n[1] * (xy[1] - 0.5) > c;
            let circ = document.createElementNS(svgNS, 'circle');
            circ.setAttribute('cx', x0 + xy[0] * size);
            circ.setAttribute('cy', y0 + xy[1] * size);
            circ.setAttribute('r', 1);
            circ.classList.add(z ? 'on' : 'off');
            s.appendChild(circ);
        }
    }
}
</script>
</html>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants