Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display list serialization / safety. #1800

Closed
glennw opened this issue Oct 3, 2017 · 19 comments
Closed

Display list serialization / safety. #1800

glennw opened this issue Oct 3, 2017 · 19 comments

Comments

@glennw
Copy link
Member

glennw commented Oct 3, 2017

We are experiencing some performance issues with the current display list serialization and deserialization implementation.

We believe that the majority of this is related to codegen issues between serde and rustc. It's possible these will be resolved in the future, but it would be good to have written down what the benefits of the current implementation are. That way, if we consider any changes to the current method, we can ensure we don't miss any of the required functionality.

As an extreme example, consider if we had something similar to the old method - the display list is a set of flat arrays of data, which is simply copied / shared to WR.

The benefits I'm aware of with the current implementation:

  1. bincode gives us compression, reducing the amount of memory that needs to be shared or copied.
  2. We get some level of safety in terms of not needing to validate the display list contents.

The value of (1) seems clear (although we may want to quantify that somehow if we haven't already?). It's less clear to me how (2) works - could we write up an example or two of how this provides safety?

@glennw
Copy link
Member Author

glennw commented Oct 3, 2017

@gankro opened an issue so we have a single place to discuss DL serialization :)

@glennw
Copy link
Member Author

glennw commented Oct 3, 2017

cc @jrmuizel

@pcwalton
Copy link
Contributor

pcwalton commented Oct 4, 2017

Yeah, I was always worried about removing the flat arrays.

I wonder if what we really want is something like Cap'n Proto, which is a zero copy serialization framework. I had looked into it before but had concerns about overengineering things. But if it turns out to improve maintainability, it might be worth it...

@Gankra
Copy link
Contributor

Gankra commented Oct 4, 2017

Background

As far as correctness is concerned, the largest constraint is it needs to support being sent over IPC. This means data must be serialized into an array of bytes in some way (no pointers). And for security reasons, if one process is compromised, it shouldn't be able to trigger Undefined Behaviour in the other process. So basically bad DL's should either fail to parse, fail to process, or produce a (probably junk) picture. (gpu attacking content through the DL should be ~impossible regardless of what we do)

Currently all that bincode really buys us here is enum/bool tag validation.

The current design of the display lists is two arrays: DisplayItems followed by a GlyphCache.

The GlyphCache is a basically a list of every glyph that will be found somewhere in the DisplayItems (Vec<(Font, Color, Vec<GlyphIndex>)>), so that we can kick of the glyph rasterization task in parallel to processing the DisplayList. This hasn't actually been implemented so it's currently all overhead to compute it on the client side.

During construction we store the DisplayItems in serialized form, and the GlyphCache as a HashMap<_, HashSet>. The HashMap/Sets can be flattened to Vecs during finalization because the backend just iterates over them to read the contents.

The reason DisplayItems are always serialized, at this point, is because our current serilization strategy (bincode) significantly compresses them, saving on memory usage (which is ultimately time savings at this scale). Compression is so significant because the SpecificDisplayItem enum is lopsided.

Some DisplayItems also come with auxiliary arrays. For instance, a TextDisplayItem effectively has a Vec<GlyphInstance>. These arrays are serialized either before or after the item, based on how they're built. For the ones before, there are dummy DisplayItems that basically say "hey parse an array of this type, you'll need it for the next item". To the consumer, this is all mostly transparent.

During consumption, we deserialize contents "on demand". This avoids allocating an entire array for the deserialized results (which would be huge).

Problems With The Current Design

Bincode is pretty slow. A lot of this is terrible codegen. @jrmuizel is looking into this.

The glyph cache is wasteful in several ways
* We send glyphs ~twice (more true in CJK than latin-based)
* We need to serialize it into the DL's Vec for finalize(), but then this just gets written into the IPC channel.

We copy on "both sides" of the IPC. Copying on the backend is probably necessary for safety. The frontend is more complicated (more on that later).

For auxiliary arrays, we have to jump over and potentially deserialize them, even if we don't access them right away. This is really bad for cache usage.

Potential Design Changes

So there's, in my mind, two big ideas we can look at:

N-Arrays

Rather than having a linear display list, we break the display list into n arrays, one for each kind of DisplayItem (and potentially one for Clips?). The actual display list then becomes a Vec<(arrayNumber, arrayIdx)>, or, to put it another way, SpecificDisplayItem will replace all of its payloads with pointers.

In the backend/ipc-channel, these arrays will all be concatenated into one, with a table of byte offsets at the front.

Some loose notes:

We get some compression "for free" under this system, because some of the lopsidedness is lost. For instance the Clip array won't have to give each clip space for PushStackingContext's two Matrices. But PushStackingContext's array will still be lopsided because one of those is itself Option. Those could also be outlined into their own array, but there's probably diminishing returns (or net losses) taking this too far.

We can eliminate serde completely for POD display items. Transmuting their entire array will be safe.

The GlyphCache doesn't need to exist anymore, as the backend can just jump to the TextDisplayItem array and process that. This saves us some bandwidth.

We don't need to "walk over" auxiliary arrays that we don't want to process right now, as they'll be out of line. However in exchange we may take more cache misses jumping to one of the arrays (but we should generally process the contents of a given array monotonically, so that might be fine?).

We lose some bandwidth recording all the array indices. However, maybe not: if a kind of item isn't multiply referenced (everything but clips?), we can just remember where we last were in that item's array! So basically the backend would have n offsets it maintains as it processes. When it sees "TextDisplayItem"'s tag in the DL, it reads out buf[text_items_offset + text_item_count * text_item_size], and increments text_item_count.

N-arrays can also be done partially. For instance, only outlining clips or other problematic items.

This whole thing is a lot more of a maintenance burden than the current linear system, though.

IPCBuilder

In theory we could give DisplayListBuilder an actual IPC channel, and have it push/serialize items directly into that. Since the backend needs to copy on its side anyway, and also internally receives the DL in chunks, this is saving us a huge allocation and copy on the frontend with little change to the backend.

Copying on the frontend is currently hard to avoid because gecko wants to "hold on" to completed display lists to combine them. It's possible @jrmuizel's work will eliminate this? I'm not sure if there are other complications.

Note that this is slightly incompatible with the N-Arrays approach. Although it may just be as simple as "the backend maintains n arrays". But that might lead to too much IPC traffic, since it will be harder to buffer?

@glennw
Copy link
Member Author

glennw commented Oct 4, 2017

Thanks for writing this up! It looks like your last sentence in the Background section is incomplete. I'll follow up in detail tomorrow.

@jrmuizel
Copy link
Collaborator

jrmuizel commented Oct 6, 2017

So I tried out a rustc with rust-lang/rust#45012 and while it may have helped the serialization code a little, the generated code is still bad. Right now it looks llvm is getting overwhelmed when the structs are big or complicated. I took a quick look at the deserialization code and it still didn't seem great, but I didn't look closely. I'll investigate further.

@jrmuizel
Copy link
Collaborator

jrmuizel commented Oct 6, 2017

Yeah, it looks like if you're serializing a struct that has two Options in it llvm gets all confused and generates bad code. I'll file a rust issue about it and maybe @sunfishcode can help.

@pcwalton
Copy link
Contributor

pcwalton commented Oct 6, 2017

Fascinating. When I looked into this stuff last I pretty quickly ran into fundamental issues around aliasing, which led me to the conclusion that this stuff is better done on MIR. But if there's lower hanging fruit in LLVM that would be great.

@jrmuizel
Copy link
Collaborator

jrmuizel commented Oct 6, 2017

I filed rust-lang/rust#45068 about the serialization codegen issue.

@jrmuizel
Copy link
Collaborator

jrmuizel commented Oct 9, 2017

Now that #1830 has landed our serialization code is decent enough.

I've since started looking at the deserialization code. The current serde/bincode approach is going to give llvm a really rough job of generating good code. The generated rust code involves lots of functions returning the values of there fields, because rust doesn't do a good job with return-value-optimization (rust-lang/rust#34861) it looks like we end up copying a lot of data around during deserialization. Once the sub structs get big enough llvm starts using calls to memcpy and are chances of good performance decrease further. An additional impediment to good codegen will branchiness caused by the fine grained error handling.

I'm going to try to put together a stand-alone test that shows off the bad codegen and we can see what can be done on the rust side. At the same time we probably need to have a solution that doesn't require heroics from the compiler. One can imagine a serde style macro that instead of returning a value would take an &mut and fill in the fields and only return a bool on success or failure. We could pass in a SpecificDisplayItem initialized to mem::uninitialized(). This should be enough to avoid having all values being moved all over the place. @dtolnay how easy would it be to build an unsafe deserializer like this?

@glennw
Copy link
Member Author

glennw commented Oct 9, 2017

@jrmuizel Thanks for following up on this stuff. I tend to agree with you in regards to the compiler heroics.

From talking to @gankro, the safety guarantees we get from using serde/bincode are that we know that enums / bools are valid values (although they could be modified to a different valid enum value, without detection) and perhaps some extra level of bounds checking. There's nothing that gets validated for ints / floats, since any bit pattern in those is valid. There are, of course, other benefits we get too (such as compression).

The other extreme from using serde / bincode seems to be a straight copy / transmute of the DL, and we manually write (unsafe) validation helpers for each struct type in the DL that check the validity of each field type. I think we could reasonably do this validation while we walk the DL during the flatten stage. It should be fairly easy to make this close to optimal in terms of performance while allowing validation?

I know it's far from ideal (having to manually audit the validation code, and keep it up to date), but it seems like it could work, as a worst case, if we can't sort out these codegen issues?

@dtolnay
Copy link

dtolnay commented Oct 9, 2017

One can imagine a serde style macro that instead of returning a value would take an &mut and fill in the fields and only return a bool on success or failure.

Serde is interested in supporting this style -- tracked in serde-rs/serde#855. We had been looking at it for the csv crate, not to avoid memcpy of the return values but to reuse memory allocations across rows deserialized from the same csv file. The return value is Result<(), D::Error> which for Bincode is 8 bytes so it should be just as efficient as a bool. Does this seem like it would fit the bill?

@Gankra
Copy link
Contributor

Gankra commented Oct 9, 2017 via email

@jrmuizel
Copy link
Collaborator

jrmuizel commented Oct 9, 2017

I've filed bincode-org/bincode#206 to start investigating bincode deserialization performance.

@nical
Copy link
Contributor

nical commented Oct 9, 2017

[...] and we manually write (unsafe) validation helpers for each struct type in the DL that check the validity of each field type.

I have zero confidence in our ability to manually maintain the safety of this over the years to come. Let's try everything else that we can first.

@jrmuizel
Copy link
Collaborator

So both serialization and deserialization are no longer have atrocious code. However looking at profiles and generated code it looks like there's still room for improvement. I've filed https://bugs.llvm.org/show_bug.cgi?id=34911 to help reduce some of the length checks.

@mrobinson
Copy link
Member

Splitting out the ClipScrollNodes to their own array makes a lot of sense to me. That means that the entire ClipScrollTree could be built in one pass and then the rest of the display list later.

@glennw
Copy link
Member Author

glennw commented Dec 14, 2017

@gankro @jrmuizel I think this could probably be closed now, in favor of opening specific issues for anything else that pops up here?

@glennw
Copy link
Member Author

glennw commented Dec 19, 2017

I think this can be closed now, in favor of more specific issues if there is remaining work. Please re-open if you think otherwise.

@glennw glennw closed this as completed Dec 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants