Inefficient x64 codegen for swizzle #93

abrown · 2019-08-12T18:52:41Z

Looking at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices I discovered that it would take me more than one instruction to implement v128.swizzle on x86. I had assumed, like @stoklund in #11, that I would be able to use PSHUFB as-is. However, I am now convinced that the assumptions of #11 may be incorrect:

Lanes with an out-of-range selector become 0 in the output vector.

According to the Intel manual (and some experiments I ran), PSHUFB uses the four least significant bits to decide which lane to grab from a vector. If the most significant bit is one (e.g. 0b10000000), then the result is zeroed. But index values in between 0x0f and 0x80 will use the four least significant bits as an index and will not zero the value. To correctly implement the spec as it currently reads we would need to copy the swizzle mask to another register, do a greater-than comparison to get a bit in the most significant position, and OR this with the original swizzle mask before using the PSHUFB instruction--four instructions instead of one.

Should v128.swizzle change to allow more optimal implementations? Are there considerations for other architectures that I am not aware of?

The text was updated successfully, but these errors were encountered:

AndrewScheidecker · 2019-08-12T19:08:33Z

See this comment for a relatively efficient way to compile it for SSE. You just need an unsigned saturated add of 112 to set the MSB on out-of-range lanes.

abrown · 2019-08-12T19:37:21Z

Cool, I guess that answers my question; thanks!

jlb6740 · 2019-08-13T17:59:59Z

See this comment for a relatively efficient way to compile it for SSE. You just need an unsigned saturated add of 112 to set the MSB on out-of-range lanes.

@abrown @AndrewScheidecker Hi guys ... I am following some SIMD implementation work and now looking at the SIMD spec more closely. Looking at this thread and then seeing the conversation here: #91, at the surface it seems the definition of instructions like this may be diluting the capabilities of x86? @abrown how many more instructions will this take ... the thread pointed to says two?

abrown · 2019-08-13T18:40:07Z

I used three:

MOVUPS to get the magic constant in place (112 in decimal, 0x70707070... in memory)
PADDUSB to do the saturating add
PSHUFB to swizzle the bytes

penzn · 2019-08-13T20:54:27Z

movups assumes that the value is in static section. That may not work for a Wasm runtime.

abrown · 2019-08-13T23:32:51Z

@penzn, after we discussed a version to do some shifts to get the 0x707070... value I looked through the Intel manual and was unable to find a shift for byte-width lanes that would work for what we described (but maybe I just didn't see it). Here's a five-instruction version that still uses moves but from immediates and registers:

MOV 0x70707070_70707070, %eax sets up half of the magic value
MOVQ %eax, %xmm0 moves into the lower eight bytes of the XMM register
PSHUFD to copy the lower eight bytes to the upper eight bytes
PADDUSB to do the saturating add
PSHUFB

In any case, the "zero out-of-range indices" requirement seems to cause a rather disappointing lowering for x86.

penzn · 2019-08-14T20:28:37Z

Yeah, you are right, what I had in mind would not work, since there is no shift packed byte instruction...

There are "VPBROADCAST[BWDQ]" instructions in AVX2, which can broadcast a value from general purpose register to all lanes, you can avoid doing en extra shuffle if that is available.

Also, see #70 and #63

arunetm · 2019-08-14T22:51:31Z

@abrown @jlb6740 swizzle will need multiple instructions using SSE4 (can be implemented as a special case of shuffle).

This is a very useful operation for application developers in expressing matrix operations like multiply, inverse etc. The extra cost of additional instructions for swizzle on IA is anticipated to be amortized away in applications and kernels. It will be very useful if we could verify this assumption using your implementation.

The current instruction set for SIMD is being iterated over and the above data point will help to make the right choices to ensure expressiveness and portable perf for wasm simd applications.

zeux · 2019-08-20T22:06:57Z

Note that as described in #68 (comment), the out-of-range behavior is substantially different between different architectures; the current spec at the time of the variable-index shuffle proposal was believed to be as optimal as the next best alternative which is to specify that only lower 4 bits affect the results, because that would require masking the extra bits off on x86.

The limitations on loading constants seems unfortunate. With RIP-relative addressing, the codegen for a shuffle with the specification as declared by the spec is, trivially (https://gcc.godbolt.org/z/gY5qE8),

shuffle(long long __vector(2), long long __vector(2)):                      # @shuffle(long long __vector(2), long long __vector(2))
        vpaddusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_0]
        vpshufb xmm0, xmm0, xmm1
        ret

It seems like any specification of shuffle that doesn't try to be 100% compatible with x64 behavior (which would penalize other platforms) would require some form of vector constant load or materialization which invariably will be expensive.

penzn · 2019-08-22T23:42:12Z

@zeux that matches @abrown's initial solution above with a load. It would be good to test swizzle's performance. Do you know of any codes to hit this instruction?

zeux · 2019-09-07T22:42:03Z

@penzn Not sure - I actually wanted to try porting my vertex codec that relies on pshufb to wasm, but the issue is that variable shuffles are currently not exposed in Emscripten's intrinsic header at all, so it's hard to experiment with this.

tlively · 2019-09-07T22:50:38Z

I expect to get the toolchain up to date with the latest spec proposal this week, so you should be able to experiment soon.

abrown · 2019-09-26T21:10:27Z

@zeux I've looked at meshoptimizer a bit and I can't see how having swizzle's "zero out-of-range index" behavior would change your code. It seems to me that to the shuffle (https://github.com/abrown/meshoptimizer/blob/284728f0f717ae9eda7f3e1d41dfaaa3423b189b/src/vertexcodec.cpp#L483) you will still need a kDecodeBytesGroupShuffle lookup table for generating the shuffle mask (https://github.com/abrown/meshoptimizer/blob/284728f0f717ae9eda7f3e1d41dfaaa3423b189b/src/vertexcodec.cpp#L418)--regardless of if the top bit is 1 or just greater than 15. Am I correctly reading that code? Or perhaps you would create a different implementation entirely without the lookup table for compiling to Wasm SIMD? (In essence, I'm trying to understand how and why the shuffle/swizzle is used in your decoding so any help is appreciated 😄).

zeux · 2019-09-27T01:44:55Z

The most important thing for the decoder is to have the shuffle in the first place; the out of range behavior that got specified represents, as far as I am aware, a reasonable compromise between different architectures (C would have made the behavior unspecified but I don’t think wasm can...).

Having said that, vertex codec does save a cycle assuming out of order shuffles return zero. The way this works is that the bitmask for the shuffle picks a table entry that effectively requires to read a few unpacked bytes from the input stream and mix them with packed 1/2/4 bit integers. To do this, shuffle is used to get a vector where positions occupied by packed bytes get 0 (using out of range shuffles), and positions occupied by unpacked bytes contain the final value. This can then be combined with the vector that contains just packed bytes with the other slots masked off - instead of something like blendv or andnot/and/or combo this can just use or.

However if out of range behavior was different I would just spend an extra instruction to mask the bytes.

zeux · 2019-09-27T01:46:56Z

In terms of getting the codec to run on wasm, I was hoping to reuse the existing decoding scheme. The big missing piece for this is movemask which is used to obtain the index into the LUT so I was planning to emulate this part with a few scalar instructions based on original 2/4 bit value which probably isn’t too bad.

abrown · 2019-10-09T00:03:12Z

@zeux, thanks for the answers re: meshoptimizer; a couple of additional questions:

The most important thing for the decoder is to have the shuffle in the first place; the out of range behavior that got specified represents, as far as I am aware, a reasonable compromise between different architectures

Having said that, vertex codec does save a cycle assuming out of order shuffles return zero.

So it sounds to me like meshoptimizer would be OK with either "most significant index bit zeroes the lane" or "any out of range index zeroes the lane"--correct? Just as long as you have some lane-zeroing behavior?

The big missing piece for this is movemask which is used to obtain the index into the LUT

How useful would it be to have a WASM equivalent of PMOVMSKB?

abrown · 2019-10-09T00:11:44Z

Others on the thread (@AndrewScheidecker, @arunetm, @penzn, @tlively): do we know of any other workloads that would take advantage of the "any out of range index zeroes the lane" behavior?

zeux · 2019-10-09T00:30:43Z

So it sounds to me like meshoptimizer would be OK with either "most significant index bit zeroes the lane" or "any out of range index zeroes the lane"--correct? Just as long as you have some lane-zeroing behavior?

Yes - correct. Any out-of-range zeroing behavior would be fine. When we initially discussed the currently-specified behavior (any out of range indices produce 0), I thought about specifying the Intel behavior instead - but that would lead to extra instructions on ARM (to mask off bits 6-4) which didn't seem like a great tradeoff - however if we decide that this is a better balance I'm all for it.

You can refer to the cross-architecture behavior analysis here: #68 (comment)

How useful would it be to have a WASM equivalent of PMOVMSKB?

It would be very useful. For what it's worth there are other SIMD applications that need a function like this - for example, fast string scanning described in https://zeux.io/2019/04/20/qgrep-internals/ relies on it, and some other fast SIMD methods (e.g. simdjson, HyperScan) need it as well.

The only caveat is that this function needs emulation on other architectures - I'm not sure what the stance is on instructions like this. meshoptimizer has to implement emulation for NEON using horizontal instructions, but currently WASM doesn't have any horizontal instructions which makes it painful.

zeux · 2019-10-09T00:40:54Z

Oh also worth noting is that in the linked issue (#68), @Maratyszcza brought up the fact that if you have out-of-range-as-0 behavior, you can emulate lookup tables that are larger than 16 elements using basic arithmetic, e.g.

lookup32(index, table0, table1) = shuffle(index, table0) | shuffle(index - 16, table1)

If the zeroing is controlled by the high bit, the code above will not work and it will need something more complex, perhaps bitselect(shuffle(index, table0), shuffle(index - 16, table1), index >= 16) or thereabouts.

abrown · 2019-10-09T00:51:09Z

Yes, initially I thought that this was how you were using the out-of-range behavior swizzle but your explanation (and the code) make me now think otherwise. Is this lookup32 (and larger) something that we could find in other workloads?

zeux · 2019-10-09T01:39:15Z

meshoptimizer doesn’t rely on this per se - I don’t know of applications that require >16 wide table lookups OTOH, maybe @Maratyszcza can comment.

zeux · 2019-11-02T02:45:05Z

On a somewhat related note I've finished the implementation of the SIMD vertex codec for WASM. It works on Chrome Canary. I haven't looked at codegen but the performance is pretty good.

The source code is here: zeux/meshoptimizer#72, and I've also uploaded a test scene here: http://zeux.io/tests/cabin. The decoding timings are printed to the console (the function is invoked twice, for position & color stream).

The performance gains without v8x16.swizzle path are much less interesting so this is great. The movemask emulation is not so great but it's hard for me to test the impact of a "proper" movemask instruction.

penzn · 2020-10-05T21:40:46Z

Let's move discussion on swizzle from #343 here.

@tlively @omnisip the reason the V8 codegen for this is suboptimal is that there is no SSE operation with the semantics defined by the spec - hence extra instructions to handle potentially-zero output lanes.

omnisip · 2020-10-05T22:48:37Z

It's compounded by the fact that V8 doesn't know that the vector for swizzling is constant.

omnisip · 2020-10-05T22:55:49Z

@penzn @tlively why isn't there a swizzle_c that has parameters like shuffle?

tlively · 2020-10-06T01:17:05Z

I don't think we have precedent for making both an immediate argument and a stack argument version of an instruction. It's not really written down anywhere, but I think we generally expect engines to recognize constant arguments and optimize if possible (but not necessary do any more complex instruction selection than that).

omnisip · 2020-10-06T01:33:59Z

What's the point of having an immediate version of shuffle then? How is that better than just having a shuffle that works just like swizzle?

…

On Mon, Oct 5, 2020, 19:17 Thomas Lively ***@***.***> wrote: I don't think we have precedent for making both an immediate argument and a stack argument version of an instruction. It's not really written down anywhere, but I think we generally expect engines to recognize constant arguments and optimize if possible (but not necessary do any more complex instruction selection than that). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKQJWAGNNHT4RSJVHXNUTSJJVZ5ANCNFSM4ILEAZDA> .

tlively · 2020-10-06T02:05:33Z

My guess is that such an instruction would not be portable, but cc @ngzhian who knows more than me about operations available in various hardware.

omnisip · 2020-10-06T02:19:12Z

My guess is that such an instruction would not be portable, but cc @ngzhian who knows more than me about operations available in various hardware.

I'm referring to the part where swizzle only takes two v128s and shuffle takes 2 128s and 1 array. It could be three 128s.

zeux · 2020-10-06T02:19:25Z

swizzle was standardized after shuffle.

When standardizing shuffle, there was an option of limiting the instruction to only accept in-range lane indices. As the instruction already has complex mapping I'd imagine this was the more straightforward way to specify it. It looks like there was an old issue that considered whether it's valuable to allow out of range indices for that (#11) but I don't think it got traction.

When standardizing swizzle, this option didn't exist - as the input is determined at runtime, we must assign semantics to out-of-range values; these semantics were picked to be "output zero for out of bounds inputs" as behavior that was consistent with implementations on some ISAs (ARM) and reasonably cheap to emulate on others (x64).

Multiple different variants were considered but the one that made it into the spec was the minimal version that's sufficient to implement a wide range of algorithms. A two-source version was possible but would have required more complex code gen on architectures like x64, and so I believe we decided that we don't need it yet (it's equivalent to 32-byte table lookup that was discussed in #24).

If a constant-time shuffle were to be extended with zeroing capability this would probably belong outside of the scope of this specific issue, and would require use cases and documenting architecture mapping - for example, if lowering it almost always requires extra instructions compared to the lowering of shuffles today then it could be implemented in the user space.

tlively · 2020-10-06T02:21:47Z

Thanks for that detailed recap, @zeux!

omnisip · 2020-10-06T02:29:43Z

When standardizing swizzle, this option didn't exist - as the input is determined at runtime, we must assign semantics to out-of-range values; these semantics were picked to be "output zero for out of bounds inputs" as behavior that was consistent with implementations on some ISAs (ARM) and reasonably cheap to emulate on others (x64).

I don't follow this logic. There are plenty of cases where swizzles have compile time constants. Think about any time you're converting integer types or rotating fisting point values.

omnisip · 2020-10-06T02:34:20Z

In general, runtime mask generation is the exception not the norm. If you can get a fixed number of permutations for your data set, you'll never mess with runtime masks.

zeux · 2020-10-06T02:36:07Z

I don't follow this logic. There are plenty of cases where swizzles have compile time constants. Think about any time you're converting integer types or rotating fisting point values.

Compile time masks are already handled by the existing shuffle support. If you feel like the shuffle options should include handling out of range values then it should be proposed as a separate issue with specific use cases and proposed changes to lowering for common architectures.

omnisip · 2020-10-06T02:44:47Z

I don't follow this logic. There are plenty of cases where swizzles have compile time constants. Think about any time you're converting integer types or rotating fisting point values.

Compile time masks are already handled by the existing shuffle support. If you feel like the shuffle options should include handling out of range values then it should be proposed as a separate issue with specific use cases and proposed changes to lowering for common architectures.

Except those compile time masks don't allow zeroing of a lanes value, unless I'm missing something. I'll file a ticket if need be, but it seems that this logic should be part of swizzle too since shuffle is cross vector and swizzle is in place.

omnisip · 2020-10-06T04:01:41Z

@zeux I'll file a new proposal for zeroing shuffle.

ngzhian · 2021-03-17T21:53:18Z

I don't think there is anything more to this bug.
The relevant v8 tracking bug for swizzle with constant masks is https://bugs.chromium.org/p/v8/issues/detail?id=10992, this will enable single instruction swizzle (pshufb).

abrown closed this as completed Aug 12, 2019

abrown reopened this Aug 13, 2019

abrown changed the title ~~v8x16.swizzle may not match PSHUFB on x86~~ Inefficient x64 codegen for swizzle Feb 6, 2020

abrown mentioned this issue Mar 6, 2020

Implement SIMD swizzle bytecodealliance/wasmtime#1248

Merged

jlb6740 mentioned this issue Mar 19, 2020

Add .bitmask instruction family #201

Merged

penzn mentioned this issue Oct 5, 2020

Finalizing the instruction set #343

Closed

arunetm mentioned this issue Nov 9, 2020

Agenda for sync meeting 11/13/20 #390

Closed

ngzhian closed this as completed Mar 17, 2021

penzn mentioned this issue May 17, 2021

Out-of-bounds behaviour WebAssembly/flexible-vectors#35

Closed

Inefficient x64 codegen for swizzle #93

Inefficient x64 codegen for swizzle #93

Comments

abrown commented Aug 12, 2019

AndrewScheidecker commented Aug 12, 2019

abrown commented Aug 12, 2019

jlb6740 commented Aug 13, 2019

abrown commented Aug 13, 2019

penzn commented Aug 13, 2019

abrown commented Aug 13, 2019

penzn commented Aug 14, 2019

arunetm commented Aug 14, 2019

zeux commented Aug 20, 2019 • edited Loading

penzn commented Aug 22, 2019

zeux commented Sep 7, 2019

tlively commented Sep 7, 2019

abrown commented Sep 26, 2019

zeux commented Sep 27, 2019 • edited Loading

zeux commented Sep 27, 2019

abrown commented Oct 9, 2019

abrown commented Oct 9, 2019

zeux commented Oct 9, 2019

zeux commented Oct 9, 2019 • edited Loading

abrown commented Oct 9, 2019

zeux commented Oct 9, 2019

zeux commented Nov 2, 2019 • edited Loading

penzn commented Oct 5, 2020 • edited Loading

omnisip commented Oct 5, 2020

omnisip commented Oct 5, 2020

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020 via email

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020

zeux commented Oct 6, 2020 • edited Loading

tlively commented Oct 6, 2020

omnisip commented Oct 6, 2020

omnisip commented Oct 6, 2020

zeux commented Oct 6, 2020 • edited Loading

omnisip commented Oct 6, 2020

omnisip commented Oct 6, 2020

ngzhian commented Mar 17, 2021

zeux commented Aug 20, 2019 •

edited

Loading

zeux commented Sep 27, 2019 •

edited

Loading

zeux commented Oct 9, 2019 •

edited

Loading

zeux commented Nov 2, 2019 •

edited

Loading

penzn commented Oct 5, 2020 •

edited

Loading

zeux commented Oct 6, 2020 •

edited

Loading

zeux commented Oct 6, 2020 •

edited

Loading