Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Inefficient x64 codegen for swizzle #93

Closed
abrown opened this issue Aug 12, 2019 · 37 comments
Closed

Inefficient x64 codegen for swizzle #93

abrown opened this issue Aug 12, 2019 · 37 comments

Comments

@abrown
Copy link
Contributor

abrown commented Aug 12, 2019

Looking at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices I discovered that it would take me more than one instruction to implement v128.swizzle on x86. I had assumed, like @stoklund in #11, that I would be able to use PSHUFB as-is. However, I am now convinced that the assumptions of #11 may be incorrect:

Lanes with an out-of-range selector become 0 in the output vector.

According to the Intel manual (and some experiments I ran), PSHUFB uses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g. 0b10000000), then the result is zeroed. But index values in between 0x0f and 0x80 will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, and OR this with the original swizzle mask before using the PSHUFB instruction--four instructions instead of one.

Should v128.swizzle change to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?

@AndrewScheidecker
Copy link
Contributor

See this comment for a relatively efficient way to compile it for SSE. You just need an unsigned saturated add of 112 to set the MSB on out-of-range lanes.

@abrown
Copy link
Contributor Author

abrown commented Aug 12, 2019

Cool, I guess that answers my question; thanks!

@abrown abrown closed this as completed Aug 12, 2019
@jlb6740
Copy link

jlb6740 commented Aug 13, 2019

See this comment for a relatively efficient way to compile it for SSE. You just need an unsigned saturated add of 112 to set the MSB on out-of-range lanes.

@abrown @AndrewScheidecker Hi guys ... I am following some SIMD implementation work and now looking at the SIMD spec more closely. Looking at this thread and then seeing the conversation here: #91, at the surface it seems the definition of instructions like this may be diluting the capabilities of x86? @abrown how many more instructions will this take ... the thread pointed to says two?

@abrown
Copy link
Contributor Author

abrown commented Aug 13, 2019

I used three:

  • MOVUPS to get the magic constant in place (112 in decimal, 0x70707070... in memory)
  • PADDUSB to do the saturating add
  • PSHUFB to swizzle the bytes

@abrown abrown reopened this Aug 13, 2019
@penzn
Copy link
Contributor

penzn commented Aug 13, 2019

movups assumes that the value is in static section. That may not work for a Wasm runtime.

@abrown
Copy link
Contributor Author

abrown commented Aug 13, 2019

@penzn, after we discussed a version to do some shifts to get the 0x707070... value I looked through the Intel manual and was unable to find a shift for byte-width lanes that would work for what we described (but maybe I just didn't see it). Here's a five-instruction version that still uses moves but from immediates and registers:

  • MOV 0x70707070_70707070, %eax sets up half of the magic value
  • MOVQ %eax, %xmm0 moves into the lower eight bytes of the XMM register
  • PSHUFD to copy the lower eight bytes to the upper eight bytes
  • PADDUSB to do the saturating add
  • PSHUFB

In any case, the "zero out-of-range indices" requirement seems to cause a rather disappointing lowering for x86.

@penzn
Copy link
Contributor

penzn commented Aug 14, 2019

Yeah, you are right, what I had in mind would not work, since there is no shift packed byte instruction...

There are "VPBROADCAST[BWDQ]" instructions in AVX2, which can broadcast a value from general purpose register to all lanes, you can avoid doing en extra shuffle if that is available.

Also, see #70 and #63

@arunetm
Copy link
Collaborator

arunetm commented Aug 14, 2019

@abrown @jlb6740 swizzle will need multiple instructions using SSE4 (can be implemented as a special case of shuffle).

This is a very useful operation for application developers in expressing matrix operations like multiply, inverse etc. The extra cost of additional instructions for swizzle on IA is anticipated to be amortized away in applications and kernels. It will be very useful if we could verify this assumption using your implementation.

The current instruction set for SIMD is being iterated over and the above data point will help to make the right choices to ensure expressiveness and portable perf for wasm simd applications.

@zeux
Copy link
Contributor

zeux commented Aug 20, 2019

Note that as described in #68 (comment), the out-of-range behavior is substantially different between different architectures; the current spec at the time of the variable-index shuffle proposal was believed to be as optimal as the next best alternative which is to specify that only lower 4 bits affect the results, because that would require masking the extra bits off on x86.

The limitations on loading constants seems unfortunate. With RIP-relative addressing, the codegen for a shuffle with the specification as declared by the spec is, trivially (https://gcc.godbolt.org/z/gY5qE8),

shuffle(long long __vector(2), long long __vector(2)):                      # @shuffle(long long __vector(2), long long __vector(2))
        vpaddusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_0]
        vpshufb xmm0, xmm0, xmm1
        ret

It seems like any specification of shuffle that doesn't try to be 100% compatible with x64 behavior (which would penalize other platforms) would require some form of vector constant load or materialization which invariably will be expensive.

@penzn
Copy link
Contributor

penzn commented Aug 22, 2019

@zeux that matches @abrown's initial solution above with a load. It would be good to test swizzle's performance. Do you know of any codes to hit this instruction?

@zeux
Copy link
Contributor

zeux commented Sep 7, 2019

@penzn Not sure - I actually wanted to try porting my vertex codec that relies on pshufb to wasm, but the issue is that variable shuffles are currently not exposed in Emscripten's intrinsic header at all, so it's hard to experiment with this.

@tlively
Copy link
Member

tlively commented Sep 7, 2019

I expect to get the toolchain up to date with the latest spec proposal this week, so you should be able to experiment soon.

@abrown
Copy link
Contributor Author

abrown commented Sep 26, 2019

@zeux I've looked at meshoptimizer a bit and I can't see how having swizzle's "zero out-of-range index" behavior would change your code. It seems to me that to the shuffle (https://github.com/abrown/meshoptimizer/blob/284728f0f717ae9eda7f3e1d41dfaaa3423b189b/src/vertexcodec.cpp#L483) you will still need a kDecodeBytesGroupShuffle lookup table for generating the shuffle mask (https://github.com/abrown/meshoptimizer/blob/284728f0f717ae9eda7f3e1d41dfaaa3423b189b/src/vertexcodec.cpp#L418)--regardless of if the top bit is 1 or just greater than 15. Am I correctly reading that code? Or perhaps you would create a different implementation entirely without the lookup table for compiling to Wasm SIMD? (In essence, I'm trying to understand how and why the shuffle/swizzle is used in your decoding so any help is appreciated 😄).

@zeux
Copy link
Contributor

zeux commented Sep 27, 2019

The most important thing for the decoder is to have the shuffle in the first place; the out of range behavior that got specified represents, as far as I am aware, a reasonable compromise between different architectures (C would have made the behavior unspecified but I don’t think wasm can...).

Having said that, vertex codec does save a cycle assuming out of order shuffles return zero. The way this works is that the bitmask for the shuffle picks a table entry that effectively requires to read a few unpacked bytes from the input stream and mix them with packed 1/2/4 bit integers. To do this, shuffle is used to get a vector where positions occupied by packed bytes get 0 (using out of range shuffles), and positions occupied by unpacked bytes contain the final value. This can then be combined with the vector that contains just packed bytes with the other slots masked off - instead of something like blendv or andnot/and/or combo this can just use or.

However if out of range behavior was different I would just spend an extra instruction to mask the bytes.

@zeux
Copy link
Contributor

zeux commented Sep 27, 2019

In terms of getting the codec to run on wasm, I was hoping to reuse the existing decoding scheme. The big missing piece for this is movemask which is used to obtain the index into the LUT so I was planning to emulate this part with a few scalar instructions based on original 2/4 bit value which probably isn’t too bad.

@abrown
Copy link
Contributor Author

abrown commented Oct 9, 2019

@zeux, thanks for the answers re: meshoptimizer; a couple of additional questions:

The most important thing for the decoder is to have the shuffle in the first place; the out of range behavior that got specified represents, as far as I am aware, a reasonable compromise between different architectures

Having said that, vertex codec does save a cycle assuming out of order shuffles return zero.

So it sounds to me like meshoptimizer would be OK with either "most significant index bit zeroes the lane" or "any out of range index zeroes the lane"--correct? Just as long as you have some lane-zeroing behavior?

The big missing piece for this is movemask which is used to obtain the index into the LUT

How useful would it be to have a WASM equivalent of PMOVMSKB?

@abrown
Copy link
Contributor Author

abrown commented Oct 9, 2019

Others on the thread (@AndrewScheidecker, @arunetm, @penzn, @tlively): do we know of any other workloads that would take advantage of the "any out of range index zeroes the lane" behavior?

@zeux
Copy link
Contributor

zeux commented Oct 9, 2019

So it sounds to me like meshoptimizer would be OK with either "most significant index bit zeroes the lane" or "any out of range index zeroes the lane"--correct? Just as long as you have some lane-zeroing behavior?

Yes - correct. Any out-of-range zeroing behavior would be fine. When we initially discussed the currently-specified behavior (any out of range indices produce 0), I thought about specifying the Intel behavior instead - but that would lead to extra instructions on ARM (to mask off bits 6-4) which didn't seem like a great tradeoff - however if we decide that this is a better balance I'm all for it.

You can refer to the cross-architecture behavior analysis here: #68 (comment)

How useful would it be to have a WASM equivalent of PMOVMSKB?

It would be very useful. For what it's worth there are other SIMD applications that need a function like this - for example, fast string scanning described in https://zeux.io/2019/04/20/qgrep-internals/ relies on it, and some other fast SIMD methods (e.g. simdjson, HyperScan) need it as well.

The only caveat is that this function needs emulation on other architectures - I'm not sure what the stance is on instructions like this. meshoptimizer has to implement emulation for NEON using horizontal instructions, but currently WASM doesn't have any horizontal instructions which makes it painful.

@zeux
Copy link
Contributor

zeux commented Oct 9, 2019

Oh also worth noting is that in the linked issue (#68), @Maratyszcza brought up the fact that if you have out-of-range-as-0 behavior, you can emulate lookup tables that are larger than 16 elements using basic arithmetic, e.g.

lookup32(index, table0, table1) = shuffle(index, table0) | shuffle(index - 16, table1)

If the zeroing is controlled by the high bit, the code above will not work and it will need something more complex, perhaps bitselect(shuffle(index, table0), shuffle(index - 16, table1), index >= 16) or thereabouts.

@abrown
Copy link
Contributor Author

abrown commented Oct 9, 2019

Yes, initially I thought that this was how you were using the out-of-range behavior swizzle but your explanation (and the code) make me now think otherwise. Is this lookup32 (and larger) something that we could find in other workloads?

@zeux
Copy link
Contributor

zeux commented Oct 9, 2019

meshoptimizer doesn’t rely on this per se - I don’t know of applications that require >16 wide table lookups OTOH, maybe @Maratyszcza can comment.

@zeux
Copy link
Contributor

zeux commented Nov 2, 2019

On a somewhat related note I've finished the implementation of the SIMD vertex codec for WASM. It works on Chrome Canary. I haven't looked at codegen but the performance is pretty good.

The source code is here: zeux/meshoptimizer#72, and I've also uploaded a test scene here: http://zeux.io/tests/cabin. The decoding timings are printed to the console (the function is invoked twice, for position & color stream).

The performance gains without v8x16.swizzle path are much less interesting so this is great. The movemask emulation is not so great but it's hard for me to test the impact of a "proper" movemask instruction.

@abrown abrown changed the title v8x16.swizzle may not match PSHUFB on x86 Inefficient x64 codegen for swizzle Feb 6, 2020
@penzn
Copy link
Contributor

penzn commented Oct 5, 2020

Let's move discussion on swizzle from #343 here.

@tlively @omnisip the reason the V8 codegen for this is suboptimal is that there is no SSE operation with the semantics defined by the spec - hence extra instructions to handle potentially-zero output lanes.

@omnisip
Copy link

omnisip commented Oct 5, 2020

It's compounded by the fact that V8 doesn't know that the vector for swizzling is constant.

@omnisip
Copy link

omnisip commented Oct 5, 2020

@penzn @tlively why isn't there a swizzle_c that has parameters like shuffle?

@tlively
Copy link
Member

tlively commented Oct 6, 2020

I don't think we have precedent for making both an immediate argument and a stack argument version of an instruction. It's not really written down anywhere, but I think we generally expect engines to recognize constant arguments and optimize if possible (but not necessary do any more complex instruction selection than that).

@omnisip
Copy link

omnisip commented Oct 6, 2020 via email

@tlively
Copy link
Member

tlively commented Oct 6, 2020

My guess is that such an instruction would not be portable, but cc @ngzhian who knows more than me about operations available in various hardware.

@omnisip
Copy link

omnisip commented Oct 6, 2020

My guess is that such an instruction would not be portable, but cc @ngzhian who knows more than me about operations available in various hardware.

I'm referring to the part where swizzle only takes two v128s and shuffle takes 2 128s and 1 array. It could be three 128s.

@zeux
Copy link
Contributor

zeux commented Oct 6, 2020

swizzle was standardized after shuffle.

When standardizing shuffle, there was an option of limiting the instruction to only accept in-range lane indices. As the instruction already has complex mapping I'd imagine this was the more straightforward way to specify it. It looks like there was an old issue that considered whether it's valuable to allow out of range indices for that (#11) but I don't think it got traction.

When standardizing swizzle, this option didn't exist - as the input is determined at runtime, we must assign semantics to out-of-range values; these semantics were picked to be "output zero for out of bounds inputs" as behavior that was consistent with implementations on some ISAs (ARM) and reasonably cheap to emulate on others (x64).

Multiple different variants were considered but the one that made it into the spec was the minimal version that's sufficient to implement a wide range of algorithms. A two-source version was possible but would have required more complex code gen on architectures like x64, and so I believe we decided that we don't need it yet (it's equivalent to 32-byte table lookup that was discussed in #24).

If a constant-time shuffle were to be extended with zeroing capability this would probably belong outside of the scope of this specific issue, and would require use cases and documenting architecture mapping - for example, if lowering it almost always requires extra instructions compared to the lowering of shuffles today then it could be implemented in the user space.

@tlively
Copy link
Member

tlively commented Oct 6, 2020

Thanks for that detailed recap, @zeux!

@omnisip
Copy link

omnisip commented Oct 6, 2020

When standardizing swizzle, this option didn't exist - as the input is determined at runtime, we must assign semantics to out-of-range values; these semantics were picked to be "output zero for out of bounds inputs" as behavior that was consistent with implementations on some ISAs (ARM) and reasonably cheap to emulate on others (x64).

I don't follow this logic. There are plenty of cases where swizzles have compile time constants. Think about any time you're converting integer types or rotating fisting point values.

@omnisip
Copy link

omnisip commented Oct 6, 2020

In general, runtime mask generation is the exception not the norm. If you can get a fixed number of permutations for your data set, you'll never mess with runtime masks.

@zeux
Copy link
Contributor

zeux commented Oct 6, 2020

I don't follow this logic. There are plenty of cases where swizzles have compile time constants. Think about any time you're converting integer types or rotating fisting point values.

Compile time masks are already handled by the existing shuffle support. If you feel like the shuffle options should include handling out of range values then it should be proposed as a separate issue with specific use cases and proposed changes to lowering for common architectures.

@omnisip
Copy link

omnisip commented Oct 6, 2020

I don't follow this logic. There are plenty of cases where swizzles have compile time constants. Think about any time you're converting integer types or rotating fisting point values.

Compile time masks are already handled by the existing shuffle support. If you feel like the shuffle options should include handling out of range values then it should be proposed as a separate issue with specific use cases and proposed changes to lowering for common architectures.

Except those compile time masks don't allow zeroing of a lanes value, unless I'm missing something. I'll file a ticket if need be, but it seems that this logic should be part of swizzle too since shuffle is cross vector and swizzle is in place.

@omnisip
Copy link

omnisip commented Oct 6, 2020

@zeux I'll file a new proposal for zeroing shuffle.

@ngzhian
Copy link
Member

ngzhian commented Mar 17, 2021

I don't think there is anything more to this bug.
The relevant v8 tracking bug for swizzle with constant masks is https://bugs.chromium.org/p/v8/issues/detail?id=10992, this will enable single instruction swizzle (pshufb).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants