-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stroke rework #303
Comments
I have a potential refinement to the design after having written this issue. The goal is to reuse some of the work from path segment count so it doesn't need to be redone in path segment tiling. That work is atomic counts (with a scattered memory access pattern), plus prefix sums of counts followed by binary search based load balancing, which is potentially expensive. In this refinement, there is one more buffer, sized exactly the same as the segment buffer, written by path segment count and read by path segment tiling. Each element has three fields:
The latter two counts can reasonably be packed into u16. (Note: an adversarial input might overflow 64k path segments per tile; we should probably be robust in the sense of not crashing, but probably accept incorrect rendering for that tile) Then, path segment tiling is an indirect dispatch, with a thread count equal to the number of segments. Each thread processes exactly one tile. That processing is:
Thus, the memory access patterns are: contiguous reads of the new count struct, gathered reads of line soup and the tile allocation data, completed by a scattered write. There's no shared memory (or inter-thread communication of any kind) or concern about utilization. On the other hand, there's the additional read/write traffic to the count struct compared with the original plan. I expect the refinement to be considerably faster, but of course it's hard to say for sure without experimentation. |
Another potential optimization (as I'm implementing this): the "segment count" data structure might represent a small number (2 or possibly 4) tiles. The indices of the segments within the line would be sequential so wouldn't need to be stored (though there would need to be accounting for partially filled slices). The atomic increments would need to be stored, at a cost of one u16 per. Whether this is a win or lose depends greatly on the distribution of the number of tiles per line segment. It's a lose if most of the lines are one tile, and I suspect that is likely in many cases (especially font rendering). But potentially worth considering. |
I'm obviously overthinking the optimization thing. A bit of brain dump. You know the count (number of tiles covered) during flattening, almost. The "almost" part is that you do know when it is clipped to the viewport, but not when the bounding box is intersected by clipping. That's unlikely to be a significant factor in real life. Also keep in mind that retaining paths to the left of the viewport (ie when the y range overlaps but the xmax of the segment is less than xmin of the viewport) is important for coarse winding computation. When 0, the line should be discarded on flatten, so it has zero cost downstream. This will be important when culling large parts of the drawing area (ie zoomed in SVGs) but not when the application is optimizing while encoding. The following discussion pretty much only makes sense when n is large (when n is small, the savings will probably be dominated by the extra dispatches). It's expected that a significant fraction of line segments will touch only one tile. Segregate the "line soup" structure so one is for single-tile and the other is for the rest. Flattening will scatter using two atomic bump allocators. Then you have two copies of the path count and path tiling stages, specialized for count=1. The single-tile path count stage is considerably simpler, as there's no iteration over the number of tiles, thus no tradeoff between the simple iteration and load-balanced strategies; you don't get any occupancy problems doing the former, and you don't have to pay for the partition-wide prefix sum or binary search of the latter. In addition, you don't have to setup the math for conservative line rasterization (it's at least one divide). The single-tile path count stage also doesn't bump an atomic for coarse winding (backdrop), nor is there any clipping to tile boundaries. The path tiling stage is also specialized. The single-tile version can avoid some of the line rasterization logic (basically, However, the real benefit is to enable packing more than one segment into the segment count structure, as described in the previous comment. Then you don't pay any cost for reduced utilization in the 1-tile case. It's completely unclear how much of a win the above is, or even whether the path tiling will be enough of the total runtime to be worth optimizing so carefully. As usual, the only real thing to be done is implement and measure. |
Don't use GPU-side signed distance field rendering for strokes, but rather expand them on the CPU (using kurbo) during encoding, then render using fill. This is a significant performance regression, but is on the critical path path for multisampled path rendering and stroke rework (#303).
One step in moving stroke expansion to the GPU is encoding the stroke style and providing enough context to do joins and caps. The old signed distance field based renderer did not need context for joins as it always modeled each line stroke as a stadium. There are two parts: encoding the style (a fragment of what is now the For style encoding, we plan to continue to do dashing CPU-side. One reason is that dashing is invariant to transforms, while efficient stroke expansion is sensitive to transform to set the tolerance correctly. Thus, what needs to be encoded is a flag to indicate stroke vs fill, additional flags for the join and cap types, and the miter limit, and the line width (which is currently included in the encoding). Additionally, it makes sense for even/odd vs nonzero to be encoded as a flag (these are currently "magic" negative values of line width). I propose: struct Stroke {
flags_and_miter: u32,
line_width: f32,
} And that the miter limit be encoded as f16 within the first word, to be extracted using Encoding path for strokes is nontrivial, as rendering of a stroke segment requires knowledge of whether it is the first and/or last segment to conditionally render caps, and also one adjacent segment to render joins. Fortunately, I think a relatively simple extension to the existing encoding can handle these cases, at the modest cost of one additional encoded path segment per subpath - conceptually a repetition of the first segment in the subpath, but to save space it is sufficient to encode a This final segment has the Fills are unchanged (other than minor changes in detecting stroke vs fill and the fill rule). In particular, the planned optimization of eliding the subpath end bit when a subsequent subpath has an implicit moveto is still viable. This optimization is not available for strokes, however. For strokes, the two parallel curves are rendered when subpath end bit is false. In addition, the path tag of the following path segment is decoded. If it has the subpath end bit set and is open, then an end cap is drawn for the end of the path segment. Otherwise, the subsequent path segment is decoded, its start tangent calculated, and a join is drawn based on the end tangent of this segment and the start tangent of the next. When the subpath end bit is true and the subpath is open, then a start cap is drawn. If the subpath is closed, nothing is drawn (the segment must still be encoded to provide a join tangent for the previous segment). I believe this handles the needed cases and can be evaluated fully in parallel. As such, I think it represents a middle point between the concepts of 'local' and 'global' as in Nehab 2020. By the classification of that paper, it is technically a 'global' technique, yet each segment can be processed independently and there is no need to maintain ordering of the output segments (as they are It might be worth testing this with a placeholder algorithm, leveraging the existing flattening logic, to test it working end-to-end, as the anticipated Euler spiral based stroke expansion algorithm depends on some fancy computational geometry. |
This is part of the larger multisampled path rendering work, under stroke rework (#303). It refactors the GPU pipeline so that the path segments available to fine rasterization are stored as a contiguous slice rather than a linked list as before. Numerous parts of the pipeline are refactored. In the old pipeline, path segment decoding generated cubic line segments and also estimated a bounding box (somewhat imprecise), and the combination of flattening those cubics and tiling was in a separate stage (path_coarse) quite a bit later in the pipeline. In the new pipeline, path decoding is fused with flattening, generating a `LineSoup` structure (line segments associated with paths, otherwise unordered) (with bbox as a side effect), and tiling is spread over multiple stages, later in the pipeline. The first tiling stage (path_count) counts the number of tiles that will be generated. Then coarse rasterization allocates contiguous slices based on those counts. The second stage does a scattered write of the resulting tiles. Both of these stages rely on indirect dispatch, as the number of lines and the number of segments (respectively) are not known at encode time. These changes only make sense for filled paths, thus they relied on stroke expansion being done earlier, currently on the CPU.
* Pipelines downstream of flatten (draw_leaf, coarse) now extract the fill rule using a "draw flags" field. Flatten still writes this to the path bounding-box structure, which gets propagated to the draw info list, and eventually translates to the fill rule written to the PTCL. * draw_leaf no longer deals with transforming the linewidth for strokes. This was a leftover from the previous architecture and the logic is no longer needed. * bbox_clear used to contain a duplicate PathBbox data structure. It now uses the one from shader/shared/bbox.wgsl This continues the work outlined in #303
* Pipelines downstream of flatten (draw_leaf, coarse) now extract the fill rule using a "draw flags" field. Flatten still writes this to the path bounding-box structure, which gets propagated to the draw info list, and eventually translates to the fill rule written to the PTCL. * draw_leaf no longer deals with transforming the linewidth for strokes. This was a leftover from the previous architecture and the logic is no longer needed. * bbox_clear used to contain a duplicate PathBbox data structure. It now uses the one from shader/shared/bbox.wgsl This continues the work outlined in #303
A small question, so you moved part of the work from GPU to CPU, is this fast enough? I mean sorry I did not look into kurbo carefully, but I think if there are countless dash lines, with some style, does this process did the clipping before hand? or has to caculate all of them even if they are not inside the camera or window? Also I met a problme is about the line with issue, we have a very stupid requirement which is "keeping the line width static to screen", which means in the old way, I just need to change the line width according to the zoom-in/out scale once, but now I have to update all the strokes in the CPU. This makes the process a very hot path. -_-!!!! The first thought to me is just split this work into GPU and split them parallelly. My concern maybe the cpu side will slow down the whole process. Any ideas? Thanks a lot |
- Implemented the stroke cap marker segment encoding scheme outlined in #303. The proposed scheme has been slightly modified such that instead of an open/closed bit, the segment is encoded as a lineto or quadto to distinguish between the different states. Encoding the marker segment as a quad to for open segments also handles a limitation with only using linetos: for an open path, the marker must encode both the start tangent and the coordinates of the first control point to correctly position a start cap. Encoding these as 4 f32s requires an implicit moveto before the lineto which must terminate the preceding subpath. Encoding this as a quadto preserves the invariant that every stroked subpath only contains one segment with the SUBPATH_END_BIT set to 1, which now serves as the stroke cap marker. For a closed path, a lineto is sufficient since the preceding coordinate (from the lineto encoded for a "closepath" verb) must already equal the subpath's first control point. - Shitty stroking continues to work by detecting and ignoring the stroke cap marker.
- Implemented the stroke cap marker segment encoding scheme outlined in #303. The proposed scheme has been slightly modified such that instead of an open/closed bit, the segment is encoded as a lineto or quadto to distinguish between the different states. Encoding the marker segment as a quad to for open segments also handles a limitation with only using linetos: for an open path, the marker must encode both the start tangent and the coordinates of the first control point to correctly position a start cap. Encoding these as 4 f32s requires an implicit moveto before the lineto which must terminate the preceding subpath. Encoding this as a quadto preserves the invariant that every stroked subpath only contains one segment with the SUBPATH_END_BIT set to 1, which now serves as the stroke cap marker. For a closed path, a lineto is sufficient since the preceding coordinate (from the lineto encoded for a "closepath" verb) must already equal the subpath's first control point. - Shitty stroking continues to work by detecting and ignoring the stroke cap marker.
- Implemented the stroke cap marker segment encoding scheme outlined in #303. The proposed scheme has been slightly modified such that instead of an open/closed bit, the segment is encoded as a lineto or quadto to distinguish between the different states. Encoding the marker segment as a quad to for open segments also handles a limitation with only using linetos: for an open path, the marker must encode both the start tangent and the coordinates of the first control point to correctly position a start cap. Encoding these as 4 f32s requires an implicit moveto before the lineto which must terminate the preceding subpath. Encoding this as a quadto preserves the invariant that every stroked subpath only contains one segment with the SUBPATH_END_BIT set to 1, which now serves as the stroke cap marker. For a closed path, a lineto is sufficient since the preceding coordinate (from the lineto encoded for a "closepath" verb) must already equal the subpath's first control point. - Shitty stroking continues to work by detecting and ignoring the stroke cap marker.
Following the recent pipeline refactor, all stroke expansion is currently done on the CPU. This is of course very slow if the scene contains a lot of stroked path segments. This is temporary, as the end goal is to do all stroke expansion on the GPU. There has been steady progress towards building out the GPU stroker (see #408 which is the next step towards this). This is already showing promising performance numbers on the heaviest stroke test scenes ( In my local prototype, the
That's an interesting point about clipping. I don't believe the kurbo dash iterator is aware of the viewport or does any type of culling. This could be a good avenue to explore if we end up living in the CPU dashing world long enough.
Others can correct me but I believe this is generally the case with kurbo shapes and Vello currently doesn't support setting a device-space line width. If you want the stroke width to stay fixed relative to the viewport scale, you need to re-encode the stroke styles.
Would you clarify what you're asking? Are you looking for a way to process your stroke encoding in parallel to be consumed by Vello or you asking generally? |
Thanks a huge for your answering, this helps a lot. By saying "The first thought to me is just split this work into GPU and split them parallelly." I just mean there is some kind of work such as dashing or "filling a rect with some pattern", which I think doing in CPU is too time consuming, Thanks |
I'm going to close this issue, as all of the pieces have now landed in main. No doubt there's more to do, but I think we can consider strokes to be reworked. |
We have quite a bit of work planned on redoing strokes to make them richer and more correct, while retaining performance. There's a considerable amount of discussion in the Zulip thread, but I also wanted to snapshot the current state of the plans here.
At a very high level, the plan is to compute offset curve + flattening in the compute shader, expanding the stroke into a fill, then using the same mechanisms downstream to render the fills. This will replace the current distance-field approach. The latter has some nice properties, but also limitations including not handling stroke styles other than round-cap/round-join, and also not handling general (non angle preserving) affine transformations.
The following capabilities are in scope for the compute shader:
The following capabilities are, for the moment, out of scope:
Dashing will be computed during encoding, and will be handled by the inv_arclen method.
Evolutes are part of an extremely rigorous approach to spec compliance, where the marked area is defined as a line swept along the source curve, normal to the curve. Very few renderers correctly implement them, and they come at a complexity and performance cost. A serious attempt will be made to ensure results equivalent to a Minkowski sum of the path and the disc (or ellipse, under affine transform) of the stroking pen, when the stroke style is round-cap, round-join, but not an absolute guarantee for all adversarial inputs.
The main flattening method will be a variant of the fractional subdivision method currently used, but with Euler spirals as the base curve rather than quadratic Beziers. Offset curve will be done Euler spiral to Euler spiral, using the algorithms in Cleaner parallel curves with Euler spirals. The main motivation for Euler spiral is to make the cusp detection and subdivision simple, as this is a tricky recursive subdivision with Bezier-based representations. Another advantage of Euler spiral is that the round caps and joins have an exact representation. Cubic-to-cubic offsetting may also bring challenging issues with numerical stability of the quartic solver; it's only been tested with f64 arithmetic.
The proposed pipeline is:
TagMonoid
(scan of 4x reduction).Tile
structsSee #259 for the monoid-based bounding box computation, which might land separately before this. "Line soup" refers to an unsorted buffer of path id and line segment (f32, render target coordinate system). The line segments will most likely be f16, with coordinates relative to the tile. Additionally, numerical robustness work in multisampled path rendering will likely lead to a replacement of
y_edge
(currently an additional f32) with a test that x0 or x1 = 0. All this (including removal of the linked list reference) should reduce the bandwidth and memory consumption of the line segments from 24 to 8 bytes per segment.Generating the path segments contiguously rather than in a linked list also helps with multisampled path rendering (#270). Depending on the details, that work may involve other dispatches, especially computing partition-wide prefix sums of the pixel counts for each line segment. That's the reason the above pipeline has been revised slightly to output line segments into their own buffer rather than inline in the ptcl. (Having subgroups would probably also solve the performance issue, but again details depend on empirical measurement and tuning).
A potentially major improvement to the backdrop computation is doing partition-wide prefix sum (oblivious to which path the backdrops belong to), then having a bit of logic in coarse to fix those up when crossing partition boundaries. In the non-crossing case, the resolved backdrop is the backdrop of the tile minus the backdrop at the start of the row. The crossing case is similar but adding the backdrop at the end of the partition.
A possibility to explore is that the path segment count phase also outputs partition-wide prefix sums of the counts, for consumption by the path segment tiling phase.
There's a bit more computational geometry to do, notably detecting when the affine transform is non angle preserving (doing Singular Value Decomposition is probably the gold-plated version of this, though I think a quick check is possible) and computing an error metric on transformations.
Perspective transforms may be in scope, as they are required by CSS. Additionally, rational Beziers may also be in scope, as they're essentially what you get when you apply perspective transforms to Beziers. Care must be taken not to significantly slow the common case though.
The work will probably involve building CPU-side implementations of the main primitives. That will be quite useful in figuring out numerical robustness, and probably for other reasons.
The text was updated successfully, but these errors were encountered: