Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Alternative to Swizzle / Shuffle #8

Closed
billbudge opened this issue Apr 21, 2017 · 19 comments
Closed

Alternative to Swizzle / Shuffle #8

billbudge opened this issue Apr 21, 2017 · 19 comments

Comments

@billbudge
Copy link

These operations will be very difficult to implement efficiently.

Even on Intel (SSE, AVX) which has native shuffle instructions, the code generation for this is very complex. Here is the Intel instruction selection lowering in LLVM. There is quite a lot of code to implement the shuffle intrinsic:

X86ISelLowering.cpp

The ARM version is worse, since many cases require use of 2 'vtbl' instructions, in addition to materializing the two constant shuffle masks in d-registers. LLVM uses a pre-generated table of 26K entries to generate fast instruction sequences for the 32x4 shuffles.

ARMISelLowering.cpp

This is a lot of complexity for any compiler, and too much for WebAssembly translators. Most of these swizzles and shuffles will never be used. It is a hazard to provide this feature if we can't guarantee that all shuffles will be fast on all platforms.

An alternative is to implement a small set of primitive permutations that we know can be implemented efficiently, without lots of work in the translator. I think these should cover most real-world cases, and can be composed by the programmer or compiler for other shuffles. By being similar to a real ISA, it should also be straightforward to modify toolchains to support WASM SIMD.

I'm recommending a set along the lines of the ARM permutation instructions:

  • Interleave(low, high) (merge elements from two source vectors into a single vector, with low and high modifiers so we only have a single result vector.)
  • De-interleave(low, high) (inverse shuffle from interleave)
  • Transpose(low, high) (swap even elements from first source with odd elements from second source, low and high modifiers.)
  • Concatenate(k) (concatenate two source vectors, top k bytes from first, bottom 16-k bytes from second, 0 < k < 16, AKA "slide" or "window" shuffle.

Additionally, we may want to have shuffles that reverse the lanes in various patterns like the ARM vrev instructions.

On Intel these can be implemented using the pshuf instructions. We're assuming SSE 4.1 as a baseline for SIMD support right now. POWER and MIPS have similar primitive shuffles.

@stoklund
Copy link
Contributor

The discussion of these operations predates my involvement with the SIMD.js project. Perhaps @sunfishcode or @PeterJensen can remember the rationale.

We have implemented these operations in Firefox with an SSE 2 baseline. We use pshufb as the catch-all fallback when SSE 3 is available, and a very sad bounce through memory otherwise. There's no ARM implementation of SIMD.js in Firefox.

We have some experience with this representation of shuffles from LLVM IR which does more or less the same thing as this proposal. C code using Intel intrinsics is translated into abstract shufflevector instructions by Clang. Early versions of LLVM would even perform clever optimizations on chained shuffles. This had the effect of making intrinsics programmers very angry because the "optimized" shuffles would often be translated into different instruction sequences with worse performance. Today, LLVM mostly leaves shuffles alone and doesn't attempt to "optimize" them. The code generator is able to match patterns corresponding to native instructions and does something complicated for the rest. This approach works reasonably well for code written with intrinsics and for more general code like OpenCL which provides arbitrary shuffles.

I think the rationale behind the abstract shuffle representation is that you can start by implementing the catch-all fallback with pshufb. It's not going to perform great, but it gets you out the door. You can then incrementally add coverage for shuffle patterns that turn out to be important. The advantage is that wasm producers are not limited to a prescribed set of shuffle patterns.

I think it would be a good idea to develop a list of patterns that should be fast in a good implementation, and we shouldn't expect implementations to provide optimal instruction sequences for all patterns. That would indeed be a big burden.

@PeterJensen
Copy link
Collaborator

Here's a bit of history:

We used to only have one shuffle operation that would work on 32x4 vectors. We provided the shuffle indices as a mask with constants for each of the 256 possible combinations (SIMD.XXXX, SIMD.XXXY, SIMD.XXXZ, etc). This approach wasn't scalable when the 16x8 and 8x16 types were introduced. Also, in order to take advantage of two-input shuffle instructions (e.g. shufps), we changed the original shuffle to be swizzle and the shuffle was changed to take two inputs.

There's a couple of discussion threads about this here:

tc39/ecmascript_simd#74
tc39/ecmascript_simd#182

Because of the explosion of possible shuffle masks with 16x8 and 8x16 types and the desire to still have the code be readable we went with the 'list-of-indices' parameter approach instead.

IMO, this is a pretty nice low-level 'abstraction'. It covers a lot (all?) of different 'shuffle-like' instructions in both SSE and NEON.

I think it's preferable to keep this abstraction in favor of exposing 'frequently used' shuffle pattern operations like the ones mentioned above.

I don't think it's 'too much' for a wasm translator to handle. Picking the right instruction(s) based on the combination of shuffle indices shouldn't be a time consuming operation. It might be a fairly large chunk of code, but not time consuming to execute. Jacob also points out that the pshufb instruction is available for a quick initial implementation (I think NEON has a similar one (VTBL?)). When more shuffle patterns emerge as being important for performance, or newer architectures favors other instruction combinations for fixed shuffle patterns, it becomes a matter of adjusting the translator, not a matter of introducing new wasm SIMD opcodes. Isn't one of the goals of WASM to have it be as 'target-independent' as possible?

Not sure exactly why there's such a huge amount of code in LLVM that deals with picking the right instruction sequences. I did notice that the clang version of xmmintrin.h does use the __builtin_shufflevector intrinsic to implement other intrinsics, so the compiler will have to de-abstract back from shufflevector() to the instruction that matched the original user specified intrinsic. Maybe that's what Jacob mentioned that got the SIMD intrinsics developers miffed :)

Anyways, just my two cents.

@billbudge
Copy link
Author

billbudge commented Apr 25, 2017

Hi Peter, thanks for the comments.

On ARM, vtbl is indeed available for arbitrary shuffles. However, I think it's relatively expensive - it only operates on half of a SIMD register (at least for 32 bit ARMv7 CPUs). Also, the shuffle mask is not an immediate operand, and must be materialized as a constant (either through 4 mov immediates into general purpose registers and a vmov into the d-register, or 64 bits loaded from constant pool memory into a d-register. LLVM seems to try very hard to avoid vtbl.

So I see two main disadvantages to the general API:

  1. Some shuffles on some platforms will be significantly slower than expected. This leads to performance cliffs, which make it harder to get consistent performance improvement with a portable SIMD. Without consistently faster performance, it's had to justify SIMD as a feature.

  2. The wasm compiler must be more complex to generate good shuffles. At least in V8, when a wasm opcode expands to multiple instructions, it's too late to do much scheduling - interleaving the instructions with other wasm instructions to minimize stalls which are inherent in these sequences.

I realize that this penalizes ISAs that have support for fully general shuffles. I'd reconsider if others can confirm ways around the performance cliff, or that vtbl isn't that bad on ARM.

There are many attractive platform-specific capabilities that I would also argue shouldn't be part of a portable SIMD feature.

  1. ARM's powerful vld1 and vstr1 opcodes, which can stride through data to reduce the need for shuffles. These would be difficult to emulate.

  2. Fused multiply-add.

  3. ARM reciprocal and reciprocal square root estimate refinement, vrecps and vrsqrts. These require multiple instructions on non-ARM, and can be done explicitly (and scheduled properly) by the programmer / compiler.

@sunfishcode
Copy link
Member

The other consideration here is the producer side. In SIMD.js, many of our users were porting SIMD code from other platforms and/or using APIs that already offer fully general shuffles (clang C/C++ SIMD extensions, GCC C/C++ SIMD extensions, OpenCL, SPIR-V, are all developer-facing portable APIs that offer fully general shuffles). If we only give them a restricted set of shuffles, they'll end up synthesizing the shuffles they need out of multiple shuffle operations. Then, to avoid generating even worse code, engines would be obliged to recombine the shuffles, and end up facing the same fully-general shuffle problem anyway.

I'm sympathetic to the performance cliff argument. Fully general shuffles create some cliffs, and I'd be very interested in ideas for solving this. Unfortunately, restricting the set of shuffles wasm offers doesn't appear to solve it, and it makes other things more complex.

I agree about strided accesses and fma at this time; fortunately the proposal here already omits those. Reciprocal and reciprocal sqrt are more nuanced; #3 is already opened to discuss them.

@billbudge
Copy link
Author

I can see a few ways to satisfy these different approaches.

  1. Have fully general swizzle and shuffle, and the spec guarantees that certain shuffles will be fast. I would suggest the ones I list above, since I believe they are fast on all targets, plus any other shuffles that are demonstrated to be useful and fast.

  2. Have swizzle and shuffle opcodes, and separate opcodes for the above primitive shuffles. That simplifies the translator and reduces wasm code size somewhat (in general, swizzle requires 32 bits, shuffle requires 80 bits of immediate operand for the 16 lane types, while the primitive shuffle opcodes imply the mask.)

In either case, general swizzle and shuffle could be added to avoid compiler code generation problems, though they could introduce a performance cliff.

@stoklund
Copy link
Contributor

I looked up some performance data just to get an idea what we're talking about. Data sources:

In the tables below I compare the fully general shuffle instructions to a few of the fast special-purpose shuffles provided by the ISA. The numbers are "Latency" x "Throughput".

ARMv7

NEON provides general shuffles with the VTBL and VTBX instructions which come in variants that can pick lanes from 1, 2, 3, or 4 D-registers. Each instruction shuffles into a single 64-bit D register, so both 128-bit shuffles and 128-bit swizzles can be implemented by two VTBL instructions (with 4 and 2 D-register inputs respectively).

Cortex-A57 Cortex-A72
vtbl 64-bit swizzle 3 x 2 3 x 2
vtbl 64-bit shuffle 6 x 2 6 x 2
vtrn 3 x 1 3 x 1
vext* 3 x 2 3 x 2

*) The guides don't make it clear if the vext throughput is for the D or Q version. Most ASIMD instruction have half the throughput in their Q versions, so I suspect the true numbers are 3 x 1 for a Q-sized vext too.

On these micro-architectures, a general 128-bit swizzle executes with the same latency and throughput as a single fast shuffle instruction once you have the control word loaded into a D-register. A general shuffle has twice the latency of a fast instruction and the same throughput.

ARMv8 AArch64

In 64-bit mode, the TBL instruction can write a whole Q-register, and in comes in variants that read lanes from 1-4 Q registers. This means that both swizzles and shuffles can be implemented by a single TBL instruction.

Cortex-A57 Cortex-A72
tbl swizzle 6 x ? 6 x ?
tbl shuffle 9 x ? 9 x ?
trn1 3 x 1 3 x 1
ext* 3 x 2 3 x 2

In this mode, a general swizzle has 2x the latency of a fast instruction and a general shuffle is 3x the latency. Throughput numbers are incomplete. My guess is they are the same.

Intel

With SSSE3 available, general shuffles and swizzles are provided by the pshufb instruction. Swizzles in one instruction, and shuffles in two pshufb and a por to combine the results.

Skylake Silvermont Bulldozer
pshufb 1 x 1 5 x 1/5 3 x 1
por 1 x 2 1 x 2 2 x 2
pshufd 1 x 1 1 x 1 2 x 1
shufps 1 x 1 1 x 1 2 x 1

Skylake seems to do all its shuffles on port 5, so shuffle throughput is never more than 1. It does pshufb in one cycle, just like the less general shuffles. Silvermont breaks pshufb into 4 micro-ops, and AMD's Bulldozer needs an extra cycle for pshufb compared to the fast shuffles.

Conclusion

All of the important SIMD ISAs provide instructions that can be used to synthesize fully general swizzles and shuffles. (MIPS MSA has a set of vshf instructions, but I haven't found performance data). Among the micro-architectures I looked at, Silvermont stands out as having a quite slow implementation of pshufb, otherwise the general-purpose shuffle instructions are comparable to the various fast special-purpose instructions provided by the ISAs, with a tendency towards longer latencies. I'm a bit disappointed in the AArch64 shuffle too.

The common cost that everybody pays is that these instructions take their shuffle mask in a vector register rather than as an immediate operand. This cost is harder to quantify; it depends on the context. If enough vector registers are available, these constants can be materialized outside the inner loop which makes them effectively free. If not, a constant pool load is not on the critical path and usually not a big deal for an out-of-order design. It all depends, of course.

There is no doubt that the general shuffle instructions are less performant than the various special-purpose instructions provided by the SIMD ISAs. However, when used as the catch-all fallbacks they establish a pretty good performance baseline that implementations can improve on. I think the talk of performance cliffs is overstated. It's more like a performance crack in the pavement.

These performance numbers make me a bit skeptical about LLVM's perfect shuffle tables. This doesn't help either:

    unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
    unsigned Cost = (PFEntry >> 30);
    if (Cost <= 4)
      return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);
  }
  return GenerateTBL(Op, ShuffleMask, DAG);
}

What kind of science was used to determine that a TBL is preferable when an unsigned shifted by 30 is greater than 4?

LLVM's x86 backend seems to have received more shuffle-related work recently (and more science). The code is complicated, but the general approach is to prefer pshufb unless the shuffle mask can be implemented in one or two special-purpose instructions. Part of the complication is that LLVM tries to honor the original shuffle masks from the source code unless it can replace it with something that is definitely better. This is to avoid breaking code that was written using <xmmintrin.h> by people who know what they are doing.

With the general shuffles in WebAssembly, my advice to implementers would be an incremental approach:

  1. Implement support for all shuffles and swizzles using the catch-all instructions like pshufb.
  2. Match shuffle patterns that correspond to a single native instruction. This is easy.
  3. Match shuffle patterns that correspond to 2 native instructions. This is harder, and will probably be driven by benchmarking and demand rather than an attempt at completeness.

We can also provide a non-normative list of shuffle patterns that should be fast, starting from the operations @billbudge proposes here. Implementers can use that as a guide to step 3.

@PeterJensen
Copy link
Collaborator

@stoklund Very nice writeup, Jakob! Thank you!

PS: I also thought the perf cliff thing was more like a crack :) Thank you for supplying data to support that.

@PeterJensen
Copy link
Collaborator

@billbudge I would recommend your 1) suggestion. The code size savings you get by 2) is probably not worth the hazzle of dealing with extra opcodes. (Aren't you running out of opcodes? I thought I read something about that). In any case, for modules that make use of shuffles the actual static amount of those is most likely insignificant compared to the total size of the module.

@stoklund
Copy link
Contributor

@billbudge, I haven't paid much attention to code size in the proposal, and the 16-lane swizzle and shuffle instructions currently encode their 16 lane indexes in 16 immediate bytes. I am expecting SIMD instructions to be concentrated in a few functions that are hot in terms of execution time, but only a small fraction of the total program's code size.

Given the availability and performance of the general shuffle instructions, I am actually tempted to make WebAssembly's shuffle and swizzle instructions more general by permitting out-of-range immediate lane indexes. Such lanes would be zeroes in the output. This behavior is supported by all platforms. I'll file a separate issue for that.

@billbudge
Copy link
Author

Thanks for the detailed survey @stoklund - I'm currently concerned about ARMv7 code generation for this feature.

You left out the cost of materializing the shuffle mask in a SIMD register since it is hard to quantify. In theory, it can be hoisted out of loops. We don't currently do such optimizations in our translator but after asking around, it may be possible to get our register allocator to give a similar effect. Of course, these shuffle masks may spill and have to be reloaded if there is enough register pressure, so I think it's still preferable to use the primitive permute opcodes when we can (mostly on ARM).

Chrome's media team has told us that SIMD loops that spill definitely hurts performance. WebAssembly abstracts away the physical registers so I think this is a significant problem for us.

I agree that shuffle instruction size is not a real problem. I also agree that we should support general swizzles and shuffles so compilers can generate something. I don't think we'll really know how much this matters until we have some real client code.

More generally, in my own web searches, I have found almost no code written to use portable SIMD intrinsics. Has there been a successful portable SIMD design? Everyone we've talked to uses platform-specific SIMD intrinsics.

I don't think it's a given that we'll pass Apple's bar of a large enough performance improvement for a real world component (like a video codec) across multiple target platforms to justify this large addition to WebAssembly. When I see us generating more than one or two instructions to implement a SIMD opcode I get nervous.

I've also been meaning to file an issue about some other opcodes (divide, square root) that I think will be slow on ARM. I'll file that soon.

@PeterJensen
Copy link
Collaborator

PeterJensen commented Apr 25, 2017

@billbudge as far as a 'successful portable SIMD design'... I think the C++ standards committee (WG21) is considering Matthias Kretz's proposal:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0214r0.pdf

In case you're interested. It probably doesn't qualify for being a successful portable, since it's not widely used yet. At least it seems like a valiant attempt to rid the world of ugly target specific use of SIMD intrinsics :) Not sure where it is in the C++ standardization process. He talks about the need for shuffle operations in the intro, but AFAICT there's no actual proposal for an API.

@jfbastien
Copy link
Member

jfbastien commented Apr 25, 2017

@PeterJensen Matthias' proposal is fairly advanced, and I expect to see it in C++20 or at least in a TS by then. That being said, the code as written is "portable" but has the ABI just under the surface. It isn't portable in the WebAssembly sense (it doesn't map to opcodes proposed here). That proposal also punts on shuffles.

BTW this is the latest revision: http://wg21.link/p0214
You'll also find this interesting: http://wg21.link/p0349

@stoklund
Copy link
Contributor

@billbudge, yes definitely use the special-purpose NEON shuffle instructions when possible. They are likely to be faster. This goes for Intel too. For example, pshufd implements all of v32x4.swizzle and is going to be a better choice than pshufb when possible.

I agree that WebAssembly's abstracting away the physical registers is a problem. So is the diversity of pipeline designs for SIMD execution. The two Cortex-A micro-architectures I looked at above have 3-cycle latency minimum for even the most trivial integer SIMD operations and 5 cycles for floating-point. But at least in 64-bit mode you get 32 registers which helps with the pipelining. On the other hand, Intel's designs tend to have short latencies but in 32-bit mode you only have 8 XMM registers.

It is difficult to write SIMD code that performs well on those two design points simultaneously. I think our best bet is to depend on the out-of-order execution to overlap the execution of different iterations of the same loop. That way you can saturate the pipelines without using a ton of registers on something like software pipelining which is likely to cause spilling. This means that loop bodies should be short so many iterations fit in the instruction window, and loop-carried dependencies should be avoided or kept simple like in a reduction. This is easier said that done, of course.

OpenCL was a fairly successful attempt at a language with portable SIMD operations, but I don't know if it saw a lot of use for generating SIMD code running on a CPU. Most uses were for GPU code.

I am looking forward to seeing the results of your efforts. I don't think the outcome is a given either.

@AndrewScheidecker
Copy link
Contributor

What about a NxN transpose that can't be expressed as a single 2-operand operator? It seems like it would be much easier to use a special NxN transpose operator, as well as to generate good code for one.

@billbudge
Copy link
Author

Swizzle(mask, x) is equivalent to Shuffle(mask, x, x).

Do we need Swizzle if we have Shuffle?

@stoklund
Copy link
Contributor

Technically, no. LLVM only has a single shufflevector instruction.

As we have discussed here, WebAssembly consumers are expected to take a closer look at the lane indices anyway, so asking them to recognize swizzles as shuffles where all the lane indices are in the first vector is not a big deal.

@lemaitre
Copy link

It might be an old issue, but I think it is still relevant, so here is what I think:
I think it is better to have a single generic instruction taking 2 vectors to be shuffled and 1 vector for the indices.
If the 3rd vector is constant, it is possible to detect the pattern and generate the best instruction sequence possible for the most common patterns, and fallback to some generic implementation for the uncommon and/or complex ones.

It might be interesting to have a variant with only 1 vector to shuffle just to simplify the generic case.

@lemaitre
Copy link

For those who are interested, I created a merge request about this: #30

@dtig
Copy link
Member

dtig commented Feb 2, 2021

Closing this issue as the current proposal has generic shuffles and swizzles included.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants