Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper nested clipping #36

Closed
raphlinus opened this issue Nov 14, 2020 · 5 comments
Closed

Proper nested clipping #36

raphlinus opened this issue Nov 14, 2020 · 5 comments

Comments

@raphlinus
Copy link
Contributor

This issue contains a detailed design of how to properly add clipping to piet-gpu. It is a followup to #30, which represents a more limited form of clipping which might be useful in some UI contexts, but doesn't efficiently capture the general clipping problem. The PostScript-derived imaging model entails clips being nested to arbitrary depth, with the nesting defined by save and restore operations. The arbitrary depth is challenging for a GPU, which is generally optimized for fixed-sized grains of work. Thus, this design contains profound changes across the entire pipeline.

A central concern is bounding box tracking. The invariant associated with a bounding box of a drawing operation is that pixels outside the bounding box are not affected. In the case of a clip operation, it is tempting to take the bounding box of the clip mask, but this is inadequate. Pixels outside the bounding box of the clip mask are affected, in particular all drawing operations inside the clip become no-ops.

The invariant is restored when the bounding box of a clip operation is taken as the union of the bounding boxes of the drawing operations inside.

Encoding

A long-term goal of piet-gpu is to make the CPU-side encoding process as lightweight as possible. To this end, there is a nontrivial bbox tracking mechanism in the encoding stage of the pipeline, eliminating the burden to track bboxes for stroke and fill elements during CPU-side encoding. An additional potential benefit of delegating bbox tracking to the GPU is that display lists can be assembled by binary string concatenation of encoded display list fragments.

However, maintaining those properties in the face of nested clips is nontrivial, therefore we propose to regress the light weight of CPU encoding, with the hope of restoring it in a later cycle. In this design, we will track the bbox of each drawing element. When encoding a clip, we include the clip's bbox (which is the union of drawing elements contained in the clip) in a "push clip" element, and also in the corresponding "pop" element, which is encoded at the time of the restore method is called on the render context.

Note that bbox tracking also entails tracking of transforms, another task that was entirely subsumed by element processing in the previous design.

Element processing

Element processing is barely affected by this design; the main requirement is passing through the encoded bbox.

Element processing also potentially computes the bbox of the clip path. I believe it makes sense to simply discard that, and use the encoded bbox.

Binning

Binning is barely affected; the main requirement is to use the clip's bbox for binning purposes. (as opposed to the clip path's bbox, if both bboxes are in fact preserved by the element processing stage).

Coarse rasterization

An unoptimized version of coarse rasterization would be fairly simple, and likely this should be done as part of a staged implementation strategy. Basically, the "push clip" element is recorded in the ptcl as a "push clip", and likewise for a "pop". Keep in mind that "push clip" has a fair amount in common with a "fill" element, in particular the fact that it references a clip path composed of a number of path segments preceding the "push clip" element.

However, considerable optimization is possible, as in the case of fill, when the set of path segments intersecting the tile is empty. The handling will vary depending on the backdrop. When the backdrop is 0, the clip need not be encoded at all, and the drawing operations until the corresponding "pop" can be suppressed. This can be tracked as a single "culled nesting level" integer, with a default of 0. All drawing operations check it and skip when it is non-zero. A "push clip" with empty path segments and a backdrop of 0 increments it, as does any "push clip" when it is already non-zero. A "pop clip" decrements it when it is non-zero.

A related but different optimization is to avoid writing push and pop operations to the ptcl when the backdrop is non-zero (and there are also no path segments in the tile). Eliding the push is easy, but to elide the corresponding pop requires a stack to track what happened to the push. Since nesting depth is arbitrary, this would potentially require dynamic memory allocations for the stack, but I propose this efficient solution: a fixed size (32 element) stack of bools, represented as a single 4-byte word, and another scalar to track stack depth. When depth exceeds 32, the optimization is disabled (ie there is always a push operation).

A further optimization would erase a "push clip" operation written to the ptcl tape if it is immediately followed by a "pop" with no intervening drawing operations.

Fine rasterization

The fact that clips can be nested arbitrarily creates significant challenges at fine rasterization time, as dynamic memory allocations are needed. Each pixel requires a clip/blend stack, consisting of an RGB(A) pixel and an alpha mask value.

Let's review what "push clip" and "pop" actually do. The "push clip" operation takes the current RGB(A) pixel value and pushes it on the stack, along with a computed alpha value from the clip mask. (The computation of the alpha value is essentiall the same as the "fill" operation). If the blend is isolated, it resets the current RGBA pixel value to all-clear. In any case, the "pop" operation does a Porter-Duff blend of the current pixel value over the pixel value popped from the stack, modulated with the alpha value popped from the stack. (Note: it would also be useful to provide an additional opacity value in the "pop" element, which is simply multiplied by the mask alpha; this unlocks "blend group" functionality, which is not in the current Piet graphics model, but is extremely useful and should be added).

I propose using a windowed stack approach. A small stack window is provided in thread registers, and when it spills it allocates (global) memory and copies the bottom of the stack to that memory. Similarly, popping the stack when the local window is empty, it reads from memory. To allow truly unbounded stack sizes, use a linked list approach so each stack frame in memory contains a pointer (bytebuf offset) to the next frame.

Dynamic allocation of stack frames

All dynamic allocation so far in piet-gpu is done with atomic bump allocation. That's simple and effective, and in the first implementation stage we should just do that. However, with deeply nested clips over large areas, the amount of memory required may be significant; potentially multiples of the render surface. The fine rasterization clip stacks are temporary, so ideally the memory can be freed and then reused.

A general purpose malloc/free allocator running GPU-side would be quite complex. I propose an approach specialized to the expected allocation lifetime patterns for the clip stack. In particular, the allocator state is represented by a bitmap, with one bit for each frame (the frames are all the same size, which is a huge simplification compared with the general malloc case). Then there is a round robin atomic representing the current allocation slot. To allocate, the client does an atomicAdd on the slot atomic, followed by an atomicOr on the bitmap to set the bit. If the bit was already set, retry. Otherwise the allocation has succeeded. Freeing is simple, it's just an atomicAnd to reset the bit.

Do note that correct operation of this allocator requires respecting ordering constraints of the memory model; acquire on allocation and release on free (the increment of the slot atomic can be relaxed). There are similar concerns for the scan (prefix sum) used in element processing, and the code currently uses volatile as a sort of fake memory model to handle this. See Prefix sum on Vulkan for a more detailed discussion. A longer term, proper approach to this is to runtime-query the GPU driver for memory model support and switch to a shader that supports it if available.

Also note that this allocation strategy can deadlock (or perhaps "livelock" would be a better word) if memory is exhausted. There are many similar issues throughout the engine, and a proper approach would involve some combination of accurately estimating resource needs before launching compute kernels, or failing (taking care to free everything that's been allocated) and retrying, perhaps using some kind of scoreboard mechanism to track which tiles have succeeded and which need a retry.

Isolated blending

Above I mentioned the possibility of blending either being isolated or not. In the presence of complex blend modes, there is a (potentially significant) visual difference, and design tools such as Adobe Illustrator expose this choice to the user. It is also provided as an option in the W3C compositing spec.

If only the Porter-Duff "over" blend mode is supported, then in theory isolated and non-isolated blending are identical. One subtle difference is that if only "over" is used, and all blending is non-isolated, and the render target has an opaque background, then it is not actually required to store an alpha channel, and all blending can be done on RGB pixel values. It's not clear the savings are worth it, as memory traffic with chunked pixels is likely to be aligned to 4 byte boundaries, and the ALU savings of not having to blend the alpha channel are minimal.

A more compelling reason to favor non-isolated blending is that it can preserve RGB subpixel text rendering even in the presence of soft clipping and masking. See linebender/piet#289 for more discussion on this point, including significant limitations in Direct2D around subpixel rendering and clipping.

Since we don't currently support complex blend modes, and they're not in the Piet API, a reasonable strategy is to implement only non-isolated blending. When those blend modes are added, that would be a good time to revisit the strategy regarding RGB subpixel text rendering. (It's likely that on Windows and probably Linux as well, we'll want to use the system font renderer for text at normal UI sizes, to match the appearance; this assumption does depend on the intended usage, though).

Conflation artifacts

Unfortunately, soft (antialiased) clipping provides more opportunities to introduce conflation artifacts. In particular, applying the same clip mask twice in nested clips, while in theory idempotent, produces an effect not unlike applying gamma to the mask - in this simple example, it's basically squaring the alpha values.

I'm basically making the assumption that all alpha values should be handled according to Porter-Duff rules, meaning that in the case of nested clips all the alpha values are multiplied. Other renderers (Pathfinder in particular) might use a different heuristic, but I think these other approaches are fragile; for example, they're not appropriate when opacity is set on a blend group.

Most of the academic renderers (MPVG, Li et al, etc) avoid conflation artifacts by computing the equivalent of a higher resolution image with hard path boundaries (including clip masks), then doing reconstruction filtering to achieve antialiasing. Ultimately, I think piet-gpu should have these kinds of rendering modes as well. One consideration is that the memory bandwidth required for the fine rasterization clip/blend stack might increase. Certainly, supporting these modes would be another issue, but I mention them here because there might be an argument for doing something other than Porter-Duff for mask compositing.

Future: GPU-side bbox accounting

Lastly, I've given some thought to the problem of computing bboxes for nested clips. While my blog post on the stack monoid focuses on parsing use cases, I think the central idea (in particular, the use of a windowed stack) would also apply to the stack of bounding boxes needed for nested clips.

I think the main challenge to implementing such a thing efficiently is a sparser approach. Currently, element processing computes the scan of an almost-monoid for each element in the input. That's already a bit wasteful because it's expected that a tiny fraction of the input elements will be transforms. The situation is no doubt considerably worse for handling nested clips, because it's expected the stack monoid operation will be even more expensive than affine transform multiplication (just a bunch of multiply-add operations, a sweet spot for GPU).

Here's the sketch of an idea: the element processing stage doesn't do the clip stack processing, but rather does a filtering stage with the output being a compacted sequence of partially-processed elements. All clip push and pop operations are preserved in this output, and sequences of drawing operations (with no intervening clip stack operations) may be aggregated into a single element with the union of the bboxes. Then, a scan is run on this compacted sequence; a significant fraction of the input will indeed be stack ops, so even if the stack monoid is expensive to compute, its amortized cost will be low.

Obviously, such a thing is tricky to get right, so I think it should be done in a future cycle. But I mention it now because the original vision of piet-gpu doing essentially all the work GPU-side is still valid, and I think it's possible to get there, but realistically only after getting an implementation of the full imaging model (including nested clips) in place using simpler techniques.

@raphlinus
Copy link
Contributor Author

FYI, I am working on this now, and have a local branch with the CPU-side encoding of clips (and transforms, while I'm at it), including bbox tracking, though not yet tested. One question I have is whether to retain the FillMask mechanism, or whether this would fully subsume it.

@eliasnaur
Copy link
Collaborator

Thank you for working on this. I think you should get rid of FillMask; you merged it as a stop-gap.

@raphlinus
Copy link
Contributor Author

Thanks, I'll do that.

As I get into it, I'm running into a somewhat tricky issue. It's important for the BeginClip and EndClip to have exactly the same bbox, so that during coarse rasterization each tile has them properly nesting. My first try was to encode the bbox relative to the current transform, but the problem is that the current transform is not guaranteed to be exactly the same at both points in the sequence, because I "pop" a transform by concatenating its inverse.

I really want to encode things relative to current transform, because that will make it easier to reuse fragments of the scene graph, the idea is that they can be included by appending the encoded bytestring. But I'm also getting a yagni feeling. What I'm inclined to do now for expedience is encode the absolute bbox CPU-side for both begin and end.

Going forward, it is something I would want to fix, and I can think of a few approaches.

The most straightforward (or shall I say least unstraightforward) wrt the current architecture is for the EndClip to refer to the BeginClip, and for element processing of a BeginClip to write out the bbox somewhere (the easiest way, but not especially memory efficient, is to have a buffer with one bbox slot per element in the source scene), then during subsequent processing (binning, coarse raster), get the bbox for the EndClip by following an indirection and reading that.

There are other approaches. If we do a stack monoid for the clip stack, it basically takes care of it. The scan of the monoid would make the bbox available at the EndClip element, and the location to write for the BeginClip would be carried in the stack monoid. The bbox would not be encoded CPU-side at all. It's also the case that managing the transforms with a stack monoid would give you the same transform at both points, but that's not as much of a gain by itself.

I think I need to make a decision that for this cycle I'm not trying to retain scene graph fragments, and there's work to be done in a future cycle after we get this working. It's tempting though, because it feels like most of the pieces in are in place, just not all of them.

raphlinus added a commit that referenced this issue Nov 21, 2020
I realized there's a problem with encoding clip bboxes relative to the
current transform (see #36 for a more detailed explanation), so this is
changing it to absolute bboxes.

This more or less gets clips working. There are optimization
opportunities (all-clear and all-opaque mask tiles), and it doesn't deal
with overflow of the blend stack, but it seems to basically work.
raphlinus added a commit that referenced this issue Nov 27, 2020
Optimize tiles with clip masks that are all-zero or all-one.

Part of #36
eliasnaur added a commit to eliasnaur/piet-gpu that referenced this issue Nov 29, 2020
Obsoleted by BeginClip/EndClip.

Updates linebender#36

Signed-off-by: Elias Naur <[email protected]>
@raphlinus
Copy link
Contributor Author

I'm wondering whether to keep this open, as the feature works. There are two optimizations for future work: one is moving the bbox analysis to the GPU, and the other is doing malloc+free lifetime cycles in the scratch buffer for blending. There are two schools of thought, one is to keep the issue open as a sort of design document until everything is done, the other is to keep a flow, closing the original issue and opening new ones for followup. Most projects seem to gravitate to the latter, but maybe we'll run piet-gpu differently.

@raphlinus
Copy link
Contributor Author

I'm going to close this isssue, as we definitely have nested clipping now, and are exploring a number of different ways of tuning and optimizing. The plan expressed for dynamic memory allocation may be useful to reference but is probably not what we're actually going to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants