-
Notifications
You must be signed in to change notification settings - Fork 43
Packed lane indices #69
Comments
A WASM binary would need to contains a sizable amount of vector shuffles, for the cost of the packing / unpacking to become relevant, and for the savings of this optimization to be worth it. IMO the more important issue to resolve here is whether WASM immediates should, in general, be as packed as possible, or whether we don't care. Not packing the immediates for any single instruction will probably not have any effect, but if one starts wasting space for all instructions with immediates, then all of it sooner or later adds up. |
I went and modified the unpacker example from #30:
So that is 60+ instructions at -O3. |
Is it really that onerous for engines to unpack 16 5-bit values every once in a while? I would expect an engine to either be memory or network bound, in which case packing the values is helpful, or CPU bound because it is doing enough work that the extra time to unpack the values would be negligible. The only argument against packing the immediates I find convincing in the absence of data to the contrary is that the benefits are not worth the extra complexity in the spec, but that's not a very compelling argument either. |
In general we haven't tried too hard to minimize the size of uncompressed wasm. For example, every load/store has an extra two bytes at least for the alignment and offset. The cost of 6 extra bytes for a rare instruction like shuffle is surely negligible. I'd prefer to keep the specification simple, but I don't think it's just a spec concern. For example, it is very tempting to write code as seen above when handling packed bits, but neither C nor C++ specify enough to use them this way: see https://en.cppreference.com/w/c/language/bit_field and https://en.cppreference.com/w/cpp/language/bit_field. Of course we can write code to pack and unpack the bits manually, but I'm not certain it will be worth the effort. |
At least for WABT the code is already written :P |
I think it is a valid point, it would work better if we do this optimization in sync with the rest of WASM spec (which would be post-MVP).
True, I just took the example as it was written in one of the PR comments, I am actually not sure if it works -- my bad. Yes, the semantics of bit fields are tricky and a portable solution would probably be more complex, which would also be true for optimizing packing in the rest of WASM spec. |
I first would like to say that the unpack implementation with bit fields is just a quick and dirty way to do it. That being said, I don't think packing immediates really adds complexity to the spec (it is just really saying "immediates are packed, meaning there is no padding bits between them"). However, it sure slows down the translation speed of a wasm program. 60 instructions looks huge, but in fact, it will but still much faster than a single L3 cache miss (O(400 cycles)). Should this be delayed to post-MVP?I really think it should not, for a very simple reason: this encoding change will break the binary compatibility. |
Does anyone have data for how frequent shuffles are in a real-world project compiled with wasm SIMD? My intuition is that packing the immediates is only worth it if it can lead to a code size savings of at least a couple percent. Perhaps @kripken wants to chime in on what threshold he would use, since I know he cares a lot about code size. |
@lemaitre Sorry, not trying to call anybody out about the C code. My point was just that bit-packing does add complexity, in specification and in implementation. @tlively For 2 MiB wasm file, a 1% savings would mean ~20k bytes. At 6 bytes saved per instruction, we'd need to see ~3500 shuffles. Just for comparison, the BananaBread wasm file, which is ~2MiB, has the following opcode counts:
I'd be pretty surprised to see as many shuffles as common operations like |
Me too. The WASM module would need to be using SIMD everywhere and a lot of shuffles. In my experience, only the hottest code is worth vectorizing. |
Since no one has pushed on this in over a year, I suggest we close this as not worth the (small) extra complexity. |
Also worth noting that I'd expect applications to use a small number of unique shuffle masks, which means they can be compressed by LZ efficiently - and I'd expect that to be the primary delivery method for size-constrained applications (using deflate transport). So it doesn't seem to be a big deal. |
Closing this issue as this is no longer active. |
#30 proposes to reduce the size of immediate shuffle mask to 10 bytes (as opposed to 16), by using minimum number of bits necessary for indexing lanes. It would reduce the size of WASM binaries, but may come with a performance penalty for unpacking. Discussion thread mentioned above demonstrated that reduction in size of encoded instruction would be pretty significant, but we probably need to assess the cost for runtimes.
The text was updated successfully, but these errors were encountered: