-
Notifications
You must be signed in to change notification settings - Fork 43
Inefficient x64 codegen for swizzle #93
Comments
See this comment for a relatively efficient way to compile it for SSE. You just need an unsigned saturated add of 112 to set the MSB on out-of-range lanes. |
Cool, I guess that answers my question; thanks! |
@abrown @AndrewScheidecker Hi guys ... I am following some SIMD implementation work and now looking at the SIMD spec more closely. Looking at this thread and then seeing the conversation here: #91, at the surface it seems the definition of instructions like this may be diluting the capabilities of x86? @abrown how many more instructions will this take ... the thread pointed to says two? |
I used three:
|
|
@penzn, after we discussed a version to do some shifts to get the
In any case, the "zero out-of-range indices" requirement seems to cause a rather disappointing lowering for x86. |
Yeah, you are right, what I had in mind would not work, since there is no shift packed byte instruction... There are "VPBROADCAST[BWDQ]" instructions in AVX2, which can broadcast a value from general purpose register to all lanes, you can avoid doing en extra shuffle if that is available. |
@abrown @jlb6740 swizzle will need multiple instructions using SSE4 (can be implemented as a special case of shuffle). This is a very useful operation for application developers in expressing matrix operations like multiply, inverse etc. The extra cost of additional instructions for swizzle on IA is anticipated to be amortized away in applications and kernels. It will be very useful if we could verify this assumption using your implementation. The current instruction set for SIMD is being iterated over and the above data point will help to make the right choices to ensure expressiveness and portable perf for wasm simd applications. |
Note that as described in #68 (comment), the out-of-range behavior is substantially different between different architectures; the current spec at the time of the variable-index shuffle proposal was believed to be as optimal as the next best alternative which is to specify that only lower 4 bits affect the results, because that would require masking the extra bits off on x86. The limitations on loading constants seems unfortunate. With RIP-relative addressing, the codegen for a shuffle with the specification as declared by the spec is, trivially (https://gcc.godbolt.org/z/gY5qE8),
It seems like any specification of shuffle that doesn't try to be 100% compatible with x64 behavior (which would penalize other platforms) would require some form of vector constant load or materialization which invariably will be expensive. |
@penzn Not sure - I actually wanted to try porting my vertex codec that relies on pshufb to wasm, but the issue is that variable shuffles are currently not exposed in Emscripten's intrinsic header at all, so it's hard to experiment with this. |
I expect to get the toolchain up to date with the latest spec proposal this week, so you should be able to experiment soon. |
@zeux I've looked at |
The most important thing for the decoder is to have the shuffle in the first place; the out of range behavior that got specified represents, as far as I am aware, a reasonable compromise between different architectures (C would have made the behavior unspecified but I don’t think wasm can...). Having said that, vertex codec does save a cycle assuming out of order shuffles return zero. The way this works is that the bitmask for the shuffle picks a table entry that effectively requires to read a few unpacked bytes from the input stream and mix them with packed 1/2/4 bit integers. To do this, shuffle is used to get a vector where positions occupied by packed bytes get 0 (using out of range shuffles), and positions occupied by unpacked bytes contain the final value. This can then be combined with the vector that contains just packed bytes with the other slots masked off - instead of something like blendv or andnot/and/or combo this can just use or. However if out of range behavior was different I would just spend an extra instruction to mask the bytes. |
In terms of getting the codec to run on wasm, I was hoping to reuse the existing decoding scheme. The big missing piece for this is movemask which is used to obtain the index into the LUT so I was planning to emulate this part with a few scalar instructions based on original 2/4 bit value which probably isn’t too bad. |
@zeux, thanks for the answers re: meshoptimizer; a couple of additional questions:
So it sounds to me like meshoptimizer would be OK with either "most significant index bit zeroes the lane" or "any out of range index zeroes the lane"--correct? Just as long as you have some lane-zeroing behavior?
How useful would it be to have a WASM equivalent of |
Others on the thread (@AndrewScheidecker, @arunetm, @penzn, @tlively): do we know of any other workloads that would take advantage of the "any out of range index zeroes the lane" behavior? |
Yes - correct. Any out-of-range zeroing behavior would be fine. When we initially discussed the currently-specified behavior (any out of range indices produce 0), I thought about specifying the Intel behavior instead - but that would lead to extra instructions on ARM (to mask off bits 6-4) which didn't seem like a great tradeoff - however if we decide that this is a better balance I'm all for it. You can refer to the cross-architecture behavior analysis here: #68 (comment)
It would be very useful. For what it's worth there are other SIMD applications that need a function like this - for example, fast string scanning described in https://zeux.io/2019/04/20/qgrep-internals/ relies on it, and some other fast SIMD methods (e.g. simdjson, HyperScan) need it as well. The only caveat is that this function needs emulation on other architectures - I'm not sure what the stance is on instructions like this. meshoptimizer has to implement emulation for NEON using horizontal instructions, but currently WASM doesn't have any horizontal instructions which makes it painful. |
Oh also worth noting is that in the linked issue (#68), @Maratyszcza brought up the fact that if you have out-of-range-as-0 behavior, you can emulate lookup tables that are larger than 16 elements using basic arithmetic, e.g.
If the zeroing is controlled by the high bit, the code above will not work and it will need something more complex, perhaps |
Yes, initially I thought that this was how you were using the out-of-range behavior |
meshoptimizer doesn’t rely on this per se - I don’t know of applications that require >16 wide table lookups OTOH, maybe @Maratyszcza can comment. |
On a somewhat related note I've finished the implementation of the SIMD vertex codec for WASM. It works on Chrome Canary. I haven't looked at codegen but the performance is pretty good. The source code is here: zeux/meshoptimizer#72, and I've also uploaded a test scene here: http://zeux.io/tests/cabin. The decoding timings are printed to the console (the function is invoked twice, for position & color stream). The performance gains without v8x16.swizzle path are much less interesting so this is great. The movemask emulation is not so great but it's hard for me to test the impact of a "proper" movemask instruction. |
It's compounded by the fact that V8 doesn't know that the vector for swizzling is constant. |
I don't think we have precedent for making both an immediate argument and a stack argument version of an instruction. It's not really written down anywhere, but I think we generally expect engines to recognize constant arguments and optimize if possible (but not necessary do any more complex instruction selection than that). |
What's the point of having an immediate version of shuffle then? How is
that better than just having a shuffle that works just like swizzle?
…On Mon, Oct 5, 2020, 19:17 Thomas Lively ***@***.***> wrote:
I don't think we have precedent for making both an immediate argument and
a stack argument version of an instruction. It's not really written down
anywhere, but I think we generally expect engines to recognize constant
arguments and optimize if possible (but not necessary do any more complex
instruction selection than that).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#93 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKQJWAGNNHT4RSJVHXNUTSJJVZ5ANCNFSM4ILEAZDA>
.
|
My guess is that such an instruction would not be portable, but cc @ngzhian who knows more than me about operations available in various hardware. |
I'm referring to the part where swizzle only takes two v128s and shuffle takes 2 128s and 1 array. It could be three 128s. |
swizzle was standardized after shuffle. When standardizing shuffle, there was an option of limiting the instruction to only accept in-range lane indices. As the instruction already has complex mapping I'd imagine this was the more straightforward way to specify it. It looks like there was an old issue that considered whether it's valuable to allow out of range indices for that (#11) but I don't think it got traction. When standardizing swizzle, this option didn't exist - as the input is determined at runtime, we must assign semantics to out-of-range values; these semantics were picked to be "output zero for out of bounds inputs" as behavior that was consistent with implementations on some ISAs (ARM) and reasonably cheap to emulate on others (x64). Multiple different variants were considered but the one that made it into the spec was the minimal version that's sufficient to implement a wide range of algorithms. A two-source version was possible but would have required more complex code gen on architectures like x64, and so I believe we decided that we don't need it yet (it's equivalent to 32-byte table lookup that was discussed in #24). If a constant-time shuffle were to be extended with zeroing capability this would probably belong outside of the scope of this specific issue, and would require use cases and documenting architecture mapping - for example, if lowering it almost always requires extra instructions compared to the lowering of shuffles today then it could be implemented in the user space. |
Thanks for that detailed recap, @zeux! |
I don't follow this logic. There are plenty of cases where swizzles have compile time constants. Think about any time you're converting integer types or rotating fisting point values. |
In general, runtime mask generation is the exception not the norm. If you can get a fixed number of permutations for your data set, you'll never mess with runtime masks. |
Compile time masks are already handled by the existing shuffle support. If you feel like the shuffle options should include handling out of range values then it should be proposed as a separate issue with specific use cases and proposed changes to lowering for common architectures. |
Except those compile time masks don't allow zeroing of a lanes value, unless I'm missing something. I'll file a ticket if need be, but it seems that this logic should be part of swizzle too since shuffle is cross vector and swizzle is in place. |
@zeux I'll file a new proposal for zeroing shuffle. |
I don't think there is anything more to this bug. |
Looking at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices I discovered that it would take me more than one instruction to implement
v128.swizzle
on x86. I had assumed, like @stoklund in #11, that I would be able to use PSHUFB as-is. However, I am now convinced that the assumptions of #11 may be incorrect:According to the Intel manual (and some experiments I ran),
PSHUFB
uses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g.0b10000000
), then the result is zeroed. But index values in between0x0f
and0x80
will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, andOR
this with the original swizzle mask before using thePSHUFB
instruction--four instructions instead of one.Should
v128.swizzle
change to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?The text was updated successfully, but these errors were encountered: