Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Packed lane indices #69

Closed
penzn opened this issue Mar 7, 2019 · 13 comments
Closed

Packed lane indices #69

penzn opened this issue Mar 7, 2019 · 13 comments
Labels
needs discussion Proposal with an unclear resolution

Comments

@penzn
Copy link
Contributor

penzn commented Mar 7, 2019

#30 proposes to reduce the size of immediate shuffle mask to 10 bytes (as opposed to 16), by using minimum number of bits necessary for indexing lanes. It would reduce the size of WASM binaries, but may come with a performance penalty for unpacking. Discussion thread mentioned above demonstrated that reduction in size of encoded instruction would be pretty significant, but we probably need to assess the cost for runtimes.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

A WASM binary would need to contains a sizable amount of vector shuffles, for the cost of the packing / unpacking to become relevant, and for the savings of this optimization to be worth it.

IMO the more important issue to resolve here is whether WASM immediates should, in general, be as packed as possible, or whether we don't care. Not packing the immediates for any single instruction will probably not have any effect, but if one starts wasting space for all instructions with immediates, then all of it sooner or later adds up.

@penzn
Copy link
Contributor Author

penzn commented Mar 7, 2019

I went and modified the unpacker example from #30:

$ cat unpacker.h
struct packed5x16 {
  char i0: 5;
  char i1: 5;
  char i2: 5;
  char i3: 5;
  char i4: 5;
  char i5: 5;
  char i6: 5;
  char i7: 5;
  char i8: 5;
  char i9: 5;
  char i10: 5;
  char i11: 5;
  char i12: 5;
  char i13: 5;
  char i14: 5;
  char i15: 5;
};

void unpack(char *packed, char *unpacked);
$ cat unpacker.c
#include "unpacker.h"

void unpack(char *packed, char *unpacked) {
  struct packed5x16 const * unpacker = (struct packed5x16 const*) packed;
  *(unpacked++) = unpacker->i0;
  *(unpacked++) = unpacker->i1;
  *(unpacked++) = unpacker->i2;
  *(unpacked++) = unpacker->i3;
  *(unpacked++) = unpacker->i4;
  *(unpacked++) = unpacker->i5;
  *(unpacked++) = unpacker->i6;
  *(unpacked++) = unpacker->i7;
  *(unpacked++) = unpacker->i8;
  *(unpacked++) = unpacker->i9;
  *(unpacked++) = unpacker->i10;
  *(unpacked++) = unpacker->i11;
  *(unpacked++) = unpacker->i12;
  *(unpacked++) = unpacker->i13;
  *(unpacked++) = unpacker->i14;
  *(unpacked++) = unpacker->i15;
}
$ clang -O3 -S -o - unpacker.c | cat -n
...
     6  unpack:                                 # @unpack
     7          .cfi_startproc
     8  # %bb.0:                                # %entry
     9          movb    (%rdi), %al
    10          shlb    $3, %al
...
    72          movb    %al, 15(%rsi)
    73          retq
    74  .Lfunc_end0:
    75          .size   unpack, .Lfunc_end0-unpack
    76          .cfi_endproc
...

So that is 60+ instructions at -O3.

@tlively
Copy link
Member

tlively commented Mar 7, 2019

Is it really that onerous for engines to unpack 16 5-bit values every once in a while? I would expect an engine to either be memory or network bound, in which case packing the values is helpful, or CPU bound because it is doing enough work that the extra time to unpack the values would be negligible.

The only argument against packing the immediates I find convincing in the absence of data to the contrary is that the benefits are not worth the extra complexity in the spec, but that's not a very compelling argument either.

@binji
Copy link
Member

binji commented Mar 7, 2019

In general we haven't tried too hard to minimize the size of uncompressed wasm. For example, every load/store has an extra two bytes at least for the alignment and offset. The cost of 6 extra bytes for a rare instruction like shuffle is surely negligible. I'd prefer to keep the specification simple, but I don't think it's just a spec concern. For example, it is very tempting to write code as seen above when handling packed bits, but neither C nor C++ specify enough to use them this way: see https://en.cppreference.com/w/c/language/bit_field and https://en.cppreference.com/w/cpp/language/bit_field. Of course we can write code to pack and unpack the bits manually, but I'm not certain it will be worth the effort.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

Of course we can write code to pack and unpack the bits manually, but I'm not certain it will be worth the effort.

At least for WABT the code is already written :P

@penzn
Copy link
Contributor Author

penzn commented Mar 7, 2019

In general we haven't tried too hard to minimize the size of uncompressed wasm. For example, every load/store has an extra two bytes at least for the alignment and offset. The cost of 6 extra bytes for a rare instruction like shuffle is surely negligible. I'd prefer to keep the specification simple, but I don't think it's just a spec concern.

I think it is a valid point, it would work better if we do this optimization in sync with the rest of WASM spec (which would be post-MVP).

For example, it is very tempting to write code as seen above when handling packed bits, but neither C nor C++ specify enough to use them this way: see https://en.cppreference.com/w/c/language/bit_field and https://en.cppreference.com/w/cpp/language/bit_field. Of course we can write code to pack and unpack the bits manually, but I'm not certain it will be worth the effort.

True, I just took the example as it was written in one of the PR comments, I am actually not sure if it works -- my bad. Yes, the semantics of bit fields are tricky and a portable solution would probably be more complex, which would also be true for optimizing packing in the rest of WASM spec.

@lemaitre
Copy link

lemaitre commented Mar 8, 2019

I first would like to say that the unpack implementation with bit fields is just a quick and dirty way to do it.
It will work on major compilers as they don't add any padding in these cases (packing can forced with __attribute((packed))).
It was just to show that it is not complex to implement.

That being said, I don't think packing immediates really adds complexity to the spec (it is just really saying "immediates are packed, meaning there is no padding bits between them").
It does add complexity into the virtual machine, but this can be pretty low if my quick and dirty implementation is used.

However, it sure slows down the translation speed of a wasm program.
This step is done only once so being a bit slower here is fine (especially when the bottleneck of this step will be networking).

60 instructions looks huge, but in fact, it will but still much faster than a single L3 cache miss (O(400 cycles)).
So I think, the cost is negligible.

Should this be delayed to post-MVP?

I really think it should not, for a very simple reason: this encoding change will break the binary compatibility.

@tlively
Copy link
Member

tlively commented Mar 8, 2019

Does anyone have data for how frequent shuffles are in a real-world project compiled with wasm SIMD? My intuition is that packing the immediates is only worth it if it can lead to a code size savings of at least a couple percent. Perhaps @kripken wants to chime in on what threshold he would use, since I know he cares a lot about code size.

@binji
Copy link
Member

binji commented Mar 8, 2019

@lemaitre Sorry, not trying to call anybody out about the C code. My point was just that bit-packing does add complexity, in specification and in implementation.

@tlively For 2 MiB wasm file, a 1% savings would mean ~20k bytes. At 6 bytes saved per instruction, we'd need to see ~3500 shuffles. Just for comparison, the BananaBread wasm file, which is ~2MiB, has the following opcode counts:

...
return: 5095
f32.sub: 4340
drop: 4260
i32.load8_s: 4190
i32.store8: 4142
i32.eq: 3588
i32.ne: 3326
i32.sub: 3142
i32.or: 2938
i64.load: 2935
i64.const: 2581
...

I'd be pretty surprised to see as many shuffles as common operations like i32.eq

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 8, 2019

I'd be pretty surprised to see as many shuffles as common operations like i32.eq

Me too. The WASM module would need to be using SIMD everywhere and a lot of shuffles. In my experience, only the hottest code is worth vectorizing.

@tlively
Copy link
Member

tlively commented Mar 28, 2020

Since no one has pushed on this in over a year, I suggest we close this as not worth the (small) extra complexity.

@tlively tlively added the needs discussion Proposal with an unclear resolution label Mar 28, 2020
@zeux
Copy link
Contributor

zeux commented Apr 3, 2020

Also worth noting that I'd expect applications to use a small number of unique shuffle masks, which means they can be compressed by LZ efficiently - and I'd expect that to be the primary delivery method for size-constrained applications (using deflate transport). So it doesn't seem to be a big deal.

@dtig
Copy link
Member

dtig commented Jan 8, 2021

Closing this issue as this is no longer active.

@dtig dtig closed this as completed Jan 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs discussion Proposal with an unclear resolution
Projects
None yet
Development

No branches or pull requests

7 participants