-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Layer 1 encoding proposal #1180
Comments
Very cool! I don't think round-trip needs to be a goal here, especially for padding purpose. It's easier for most tools to use extra padding on varint but I don't see why a compressor would want to keep that. Are there other things that get modified in a round-trip? There's plenty of things we can reorder, but IIUC your tool doesn't have sufficient semantic understanding to do this, right? When you split things into segments, where do those boundaries fall? Would it make sense to always have them fall at function boundaries, and pack together functions up to a certain size? My intuition is that the decoded output wants to then hand off to the compiler, so having breaks at function boundaries helps keep the pipeline simple. Even reorder function ordering if we wanted to help parallel compilation? My guess it that the function definitions, followed by data sections, are the biggest things. Does that match what you get? It seems like each section type probably wants its own type of compression, and the most format-aware stuff will come from the function definitions. Where do people think the best place to discuss this would be? It seems like a short intro in a CG video call would be a good start, and might help get compression-interested people to get on board with a feature. |
Right, round-trip is mostly for testing purposes, since it lets me verify that everything is passing through correctly. It also makes sense to have a way to down-convert from layer 1 to layer 0 when working with older layer-0-only software, I think, so having a standalone implementation makes sense. I assume eventually you'd just handle all this in relevant parts of the existing tooling. From what I saw round-trips were mostly modifying the encoding of varints, since there are multiple ways to represent them and binaryen seems to reserve extra space. I did find the UE4 demo loses like 400kb of stray bytes but I'm not sure what section they're from - probably a custom one, since it's not easy to verify that I implemented custom sections correctly.
The prototype has basic semantic understanding of wasm (definitions of all the opcodes and their operand types, etc) but probably not enough for advanced optimizations. I did test some basic reorderings possible with this information but they had no positive impact. It's possible intelligent reordering would allow you to make the most common function/type indices smaller, this was the case in some of my older compression prototypes.
For segmentation the approach I used in the prototype is to attempt to split segments every 512 functions when encoding the function table. Segment splitting only occurs if the current segment is over a size threshold, so you end up with a stream of reasonably-sized segments that aren't necessarily at consistent intervals. You might want to split segments for the other tables, but the function table is the most important one. Splitting between function bodies seems like the right approach because then the process of decoding a function will never span segments, so you can just bump pointers when reading. The segments act as a good unit if you want to compile in parallel for sure.
The code and data sections are the biggest part of the test modules I used, yeah. With the exception of the UE4 test, which for some reason got uploaded to the net with a 9mb names section. Not sure why they didn't strip that.
You can definitely come up with a unique approach to encoding and compressing each section. For most of the sections I just broke the table entries down into primitive types which produces a few streams that compress a bit better: public struct export_entry {
public string field;
public external_kind kind;
public uint index;
}
// ...
public export_entryEncoder (ModuleEncoder moduleEncoder) : base(moduleEncoder) {
fields = GetStream("field");
kinds = GetStream("external_kind");
indices = GetStream("table_index");
}
public override void Encode (ref export_entry value) {
Builder.Write(value.field, fields);
kinds.Write((byte)value.kind);
Builder.Write(value.index, indices);
} In many cases you could compress those primitive streams better - you could potentially bitpack them or use huffman or something. The code section is where most of the format-aware code is, since the opcode determines how many values are being read and what types they are. The opcode/expression specific stuff is about 1k lines of C# for the full parse-wasm/encode/decode implementation. |
I have a loose proposal for a layer 1 encoding. The basic idea is to split the current sequential stream-of-structs encoding into a set of smaller data streams - integer constants, function indexes, opcodes, etc. Doing this dramatically increases the compression ratio when compressing with weaker codecs like gzip (and is a slight size reduction for stronger codecs like brotli). This also creates opportunities for parallel decoding and better icache/dcache occupancy while decoding (in particular for varints).
I spent a couple days prototyping this a while ago and got it to the point where it can round-trip the Tanks Unity demo and one of the UE4 demos (with the minor caveat that you can't fully 100% round-trip wasm modules without using the same encoder, because encoders like binaryen seem to insert filler bytes and do other weird stuff to simplify the encoding process).
In my simple prototype (1.5k lines of C# to decode & encode wasm) it takes around 2 seconds to convert the UE4 demo (38.5mb) to my proposed format, and 1.4 seconds to convert it back to the .wasm format.
I believe my proposed format is compatible with streaming compilation and can trivially be integrated into existing decoders - you essentially just maintain separate read pointers for each data type, and when reading from layer 0 files all of those pointers are the same (and increment in sync). This works because the overall ordering of the data is not altered, the data is just sliced up and redistributed. Because this format is compatible with streaming that means you can also layer gzip or brotli transport compression on top of it without having to manually buffer it into memory before decompressing or compiling.
The current proposal produces respectable gains and I think there are ways to improve the file size further - I only tried a couple dozen things. Choosing what to split and how to encode it can have a pretty significant impact - for example splitting the 'flags' and 'offset' elements of the memory immediate into their own streams makes the post-gzip file slightly larger, and I'm not totally sure why that is. The bright side is that you can experimentally verify whether a change is a file size improvement very easily. I suspect some conversations with the designers of codecs like Brotli might lead to ideas on how to improve this further.
I'm including a basic illustration of the proposal below along with data from my tests.
The text was updated successfully, but these errors were encountered: