Display list serialization / safety. #1800

glennw · 2017-10-03T23:09:32Z

We are experiencing some performance issues with the current display list serialization and deserialization implementation.

We believe that the majority of this is related to codegen issues between serde and rustc. It's possible these will be resolved in the future, but it would be good to have written down what the benefits of the current implementation are. That way, if we consider any changes to the current method, we can ensure we don't miss any of the required functionality.

As an extreme example, consider if we had something similar to the old method - the display list is a set of flat arrays of data, which is simply copied / shared to WR.

The benefits I'm aware of with the current implementation:

bincode gives us compression, reducing the amount of memory that needs to be shared or copied.
We get some level of safety in terms of not needing to validate the display list contents.

The value of (1) seems clear (although we may want to quantify that somehow if we haven't already?). It's less clear to me how (2) works - could we write up an example or two of how this provides safety?

glennw · 2017-10-03T23:09:56Z

@gankro opened an issue so we have a single place to discuss DL serialization :)

glennw · 2017-10-03T23:10:05Z

cc @jrmuizel

pcwalton · 2017-10-04T04:10:05Z

Yeah, I was always worried about removing the flat arrays.

I wonder if what we really want is something like Cap'n Proto, which is a zero copy serialization framework. I had looked into it before but had concerns about overengineering things. But if it turns out to improve maintainability, it might be worth it...

Gankra · 2017-10-04T05:14:56Z

Background

As far as correctness is concerned, the largest constraint is it needs to support being sent over IPC. This means data must be serialized into an array of bytes in some way (no pointers). And for security reasons, if one process is compromised, it shouldn't be able to trigger Undefined Behaviour in the other process. So basically bad DL's should either fail to parse, fail to process, or produce a (probably junk) picture. (gpu attacking content through the DL should be ~impossible regardless of what we do)

Currently all that bincode really buys us here is enum/bool tag validation.

The current design of the display lists is two arrays: DisplayItems followed by a GlyphCache.

The GlyphCache is a basically a list of every glyph that will be found somewhere in the DisplayItems (Vec<(Font, Color, Vec<GlyphIndex>)>), so that we can kick of the glyph rasterization task in parallel to processing the DisplayList. This hasn't actually been implemented so it's currently all overhead to compute it on the client side.

During construction we store the DisplayItems in serialized form, and the GlyphCache as a HashMap<_, HashSet>. The HashMap/Sets can be flattened to Vecs during finalization because the backend just iterates over them to read the contents.

The reason DisplayItems are always serialized, at this point, is because our current serilization strategy (bincode) significantly compresses them, saving on memory usage (which is ultimately time savings at this scale). Compression is so significant because the SpecificDisplayItem enum is lopsided.

Some DisplayItems also come with auxiliary arrays. For instance, a TextDisplayItem effectively has a Vec<GlyphInstance>. These arrays are serialized either before or after the item, based on how they're built. For the ones before, there are dummy DisplayItems that basically say "hey parse an array of this type, you'll need it for the next item". To the consumer, this is all mostly transparent.

During consumption, we deserialize contents "on demand". This avoids allocating an entire array for the deserialized results (which would be huge).

Problems With The Current Design

Bincode is pretty slow. A lot of this is terrible codegen. @jrmuizel is looking into this.

The glyph cache is wasteful in several ways
* We send glyphs ~twice (more true in CJK than latin-based)
* We need to serialize it into the DL's Vec for finalize(), but then this just gets written into the IPC channel.

We copy on "both sides" of the IPC. Copying on the backend is probably necessary for safety. The frontend is more complicated (more on that later).

For auxiliary arrays, we have to jump over and potentially deserialize them, even if we don't access them right away. This is really bad for cache usage.

Potential Design Changes

So there's, in my mind, two big ideas we can look at:

N-Arrays

Rather than having a linear display list, we break the display list into n arrays, one for each kind of DisplayItem (and potentially one for Clips?). The actual display list then becomes a Vec<(arrayNumber, arrayIdx)>, or, to put it another way, SpecificDisplayItem will replace all of its payloads with pointers.

In the backend/ipc-channel, these arrays will all be concatenated into one, with a table of byte offsets at the front.

Some loose notes:

We get some compression "for free" under this system, because some of the lopsidedness is lost. For instance the Clip array won't have to give each clip space for PushStackingContext's two Matrices. But PushStackingContext's array will still be lopsided because one of those is itself Option. Those could also be outlined into their own array, but there's probably diminishing returns (or net losses) taking this too far.

We can eliminate serde completely for POD display items. Transmuting their entire array will be safe.

The GlyphCache doesn't need to exist anymore, as the backend can just jump to the TextDisplayItem array and process that. This saves us some bandwidth.

We don't need to "walk over" auxiliary arrays that we don't want to process right now, as they'll be out of line. However in exchange we may take more cache misses jumping to one of the arrays (but we should generally process the contents of a given array monotonically, so that might be fine?).

We lose some bandwidth recording all the array indices. However, maybe not: if a kind of item isn't multiply referenced (everything but clips?), we can just remember where we last were in that item's array! So basically the backend would have n offsets it maintains as it processes. When it sees "TextDisplayItem"'s tag in the DL, it reads out buf[text_items_offset + text_item_count * text_item_size], and increments text_item_count.

N-arrays can also be done partially. For instance, only outlining clips or other problematic items.

This whole thing is a lot more of a maintenance burden than the current linear system, though.

IPCBuilder

In theory we could give DisplayListBuilder an actual IPC channel, and have it push/serialize items directly into that. Since the backend needs to copy on its side anyway, and also internally receives the DL in chunks, this is saving us a huge allocation and copy on the frontend with little change to the backend.

Copying on the frontend is currently hard to avoid because gecko wants to "hold on" to completed display lists to combine them. It's possible @jrmuizel's work will eliminate this? I'm not sure if there are other complications.

Note that this is slightly incompatible with the N-Arrays approach. Although it may just be as simple as "the backend maintains n arrays". But that might lead to too much IPC traffic, since it will be harder to buffer?

glennw · 2017-10-04T05:25:24Z

Thanks for writing this up! It looks like your last sentence in the Background section is incomplete. I'll follow up in detail tomorrow.

jrmuizel · 2017-10-06T02:34:40Z

So I tried out a rustc with rust-lang/rust#45012 and while it may have helped the serialization code a little, the generated code is still bad. Right now it looks llvm is getting overwhelmed when the structs are big or complicated. I took a quick look at the deserialization code and it still didn't seem great, but I didn't look closely. I'll investigate further.

jrmuizel · 2017-10-06T03:15:51Z

Yeah, it looks like if you're serializing a struct that has two Options in it llvm gets all confused and generates bad code. I'll file a rust issue about it and maybe @sunfishcode can help.

pcwalton · 2017-10-06T03:36:30Z

Fascinating. When I looked into this stuff last I pretty quickly ran into fundamental issues around aliasing, which led me to the conclusion that this stuff is better done on MIR. But if there's lower hanging fruit in LLVM that would be great.

jrmuizel · 2017-10-06T15:39:45Z

I filed rust-lang/rust#45068 about the serialization codegen issue.

jrmuizel · 2017-10-09T01:02:02Z

Now that #1830 has landed our serialization code is decent enough.

I've since started looking at the deserialization code. The current serde/bincode approach is going to give llvm a really rough job of generating good code. The generated rust code involves lots of functions returning the values of there fields, because rust doesn't do a good job with return-value-optimization (rust-lang/rust#34861) it looks like we end up copying a lot of data around during deserialization. Once the sub structs get big enough llvm starts using calls to memcpy and are chances of good performance decrease further. An additional impediment to good codegen will branchiness caused by the fine grained error handling.

I'm going to try to put together a stand-alone test that shows off the bad codegen and we can see what can be done on the rust side. At the same time we probably need to have a solution that doesn't require heroics from the compiler. One can imagine a serde style macro that instead of returning a value would take an &mut and fill in the fields and only return a bool on success or failure. We could pass in a SpecificDisplayItem initialized to mem::uninitialized(). This should be enough to avoid having all values being moved all over the place. @dtolnay how easy would it be to build an unsafe deserializer like this?

glennw · 2017-10-09T01:29:20Z

@jrmuizel Thanks for following up on this stuff. I tend to agree with you in regards to the compiler heroics.

From talking to @gankro, the safety guarantees we get from using serde/bincode are that we know that enums / bools are valid values (although they could be modified to a different valid enum value, without detection) and perhaps some extra level of bounds checking. There's nothing that gets validated for ints / floats, since any bit pattern in those is valid. There are, of course, other benefits we get too (such as compression).

The other extreme from using serde / bincode seems to be a straight copy / transmute of the DL, and we manually write (unsafe) validation helpers for each struct type in the DL that check the validity of each field type. I think we could reasonably do this validation while we walk the DL during the flatten stage. It should be fairly easy to make this close to optimal in terms of performance while allowing validation?

I know it's far from ideal (having to manually audit the validation code, and keep it up to date), but it seems like it could work, as a worst case, if we can't sort out these codegen issues?

dtolnay · 2017-10-09T02:04:40Z

One can imagine a serde style macro that instead of returning a value would take an &mut and fill in the fields and only return a bool on success or failure.

Serde is interested in supporting this style -- tracked in serde-rs/serde#855. We had been looking at it for the csv crate, not to avoid memcpy of the return values but to reuse memory allocations across rows deserialized from the same csv file. The return value is Result<(), D::Error> which for Bincode is 8 bytes so it should be just as efficient as a bool. Does this seem like it would fit the bill?

Gankra · 2017-10-09T02:11:40Z

This would be fantastic to have! Though I expect the fine-grained (per-field) error-handling that serde wants (needs?) to produce will be a bigger bottleneck for us.

…

On Sun, Oct 8, 2017 at 10:04 PM, David Tolnay ***@***.***> wrote: One can imagine a serde style macro that instead of returning a value would take an &mut and fill in the fields and only return a bool on success or failure. Serde is interested in supporting this style -- tracked in serde-rs/serde#855 <serde-rs/serde#855>. We had been looking at it for the csv crate, not to avoid memcpy of the return values but to reuse memory allocations across rows deserialized from the same csv file. The return value is Result<(), D::Error> which for Bincode is 8 bytes so it should be just as efficient as a bool. Does this seem like it would fit the bill? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1800 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFY4CLs7R5uMhnxdyZ5DBsNCYW4TbN_ks5sqX86gaJpZM4Ps5Yx> .

jrmuizel · 2017-10-09T15:36:44Z

I've filed bincode-org/bincode#206 to start investigating bincode deserialization performance.

nical · 2017-10-09T19:31:22Z

[...] and we manually write (unsafe) validation helpers for each struct type in the DL that check the validity of each field type.

I have zero confidence in our ability to manually maintain the safety of this over the years to come. Let's try everything else that we can first.

jrmuizel · 2017-10-11T04:11:21Z

So both serialization and deserialization are no longer have atrocious code. However looking at profiles and generated code it looks like there's still room for improvement. I've filed https://bugs.llvm.org/show_bug.cgi?id=34911 to help reduce some of the length checks.

mrobinson · 2017-10-12T16:29:36Z

Splitting out the ClipScrollNodes to their own array makes a lot of sense to me. That means that the entire ClipScrollTree could be built in one pass and then the rest of the display list later.

glennw · 2017-12-14T14:52:20Z

@gankro @jrmuizel I think this could probably be closed now, in favor of opening specific issues for anything else that pops up here?

glennw · 2017-12-19T02:54:26Z

I think this can be closed now, in favor of more specific issues if there is remaining work. Please re-open if you think otherwise.

glennw closed this as completed Dec 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display list serialization / safety. #1800

Display list serialization / safety. #1800

glennw commented Oct 3, 2017

glennw commented Oct 3, 2017

glennw commented Oct 3, 2017

pcwalton commented Oct 4, 2017

Gankra commented Oct 4, 2017 •

edited

Loading

glennw commented Oct 4, 2017

jrmuizel commented Oct 6, 2017

jrmuizel commented Oct 6, 2017

pcwalton commented Oct 6, 2017

jrmuizel commented Oct 6, 2017

jrmuizel commented Oct 9, 2017

glennw commented Oct 9, 2017

dtolnay commented Oct 9, 2017

Gankra commented Oct 9, 2017 via email

jrmuizel commented Oct 9, 2017

nical commented Oct 9, 2017

jrmuizel commented Oct 11, 2017

mrobinson commented Oct 12, 2017

glennw commented Dec 14, 2017

glennw commented Dec 19, 2017

Display list serialization / safety. #1800

Display list serialization / safety. #1800

Comments

glennw commented Oct 3, 2017

glennw commented Oct 3, 2017

glennw commented Oct 3, 2017

pcwalton commented Oct 4, 2017

Gankra commented Oct 4, 2017 • edited Loading

Background

Problems With The Current Design

Potential Design Changes

N-Arrays

IPCBuilder

glennw commented Oct 4, 2017

jrmuizel commented Oct 6, 2017

jrmuizel commented Oct 6, 2017

pcwalton commented Oct 6, 2017

jrmuizel commented Oct 6, 2017

jrmuizel commented Oct 9, 2017

glennw commented Oct 9, 2017

dtolnay commented Oct 9, 2017

Gankra commented Oct 9, 2017 via email

jrmuizel commented Oct 9, 2017

nical commented Oct 9, 2017

jrmuizel commented Oct 11, 2017

mrobinson commented Oct 12, 2017

glennw commented Dec 14, 2017

glennw commented Dec 19, 2017

Gankra commented Oct 4, 2017 •

edited

Loading