Shuffle with immediate indices specification #30

lemaitre · 2018-04-19T15:02:50Z

This a PR for shuffling instructions with immediate indices.
It aims to add back the instructions v16x8.suffle2_imm, v32x4.shuffle2_imm, v64x2.shuffle2_imm into WebAssembly.

These instructions enables better and simpler pattern matching from the WASM->ASM virtual machine to assure best performance.
Indeed, while v8x16.shuffle_imm2 can be used to emulate all the others, it is tedious to recognize the shuffling rules to use the proper instruction of the target platform.

EDITED: Old PR description

Hello everyone,

I created a small specification for shuffle and permute operations.
My intent with this merge request is mainly to have a discussion.

Few general questions:

How easy/difficult is it to implement?
How efficient this can be?
Are some operations/use-cases missing?
What are the caveats of such an approach?
...

More specialized questions:

Is permute required to avoid input duplication?
Can the shuffling operations with immediate be avoided relying on detecting constant rule?

Any comment on this is welcome.

reductions are computed with permutes

jfbastien · 2018-04-19T17:25:12Z

In general the way we've approached SIMD instructions is to see what kind of code would benefit from these opcodes, and how. Can you document in this PR:

What opcodes you want to capture for x86 and ARM (and each one's various architecture revisions)
What kind of applications benefit
What performance gains they see (assuming a compiler couldn't just have figured out the right instructions already)
What .wasm binary size difference we get

lemaitre · 2018-04-19T21:50:07Z

I want first to mention that the position of WASM is a bit different than regular ISA:
it need (a bit) more abstraction in order to have efficient implementation on architectures with different instructions.
In that respect, I think WASM should provide high-level-ish operations and generic shuffles fit completely in this scheme.

What opcodes you want to capture for x86 and ARM (and each one's various architecture revisions)

Basically every shuffling operations supported by an architecture:
SSE2:

movhlps: covered by v32x4.shuffle 6 7 3 4 or v64x2.shuffle 3 1
movlhps: covered by v32x4.shuffle 0 1 4 5 or v64x2.shuffle 0 2
movsd: covered by v64x2.shuffle 2 1
movss: covered by v32x4.shuffle 4 1 2 3
pshufd: covered by v32x4.permute A B C D with [0, 1, 2, 3]
shufpd: covered by v64x2.shuffle A B with A in [0, 1] and B in [2, 3]
shufps: covered by v32x4.shuffle A B C D with A B in [0, 1, 2, 3] and C D in [4, 5, 6, 7]
pshufhw: covered by v16x8.permute 0 1 2 3 A B C D with A B C D in [4, 5, 6, 7]
pshuflw: covered by v16x8.permute A B C D 4 5 6 7 with A B C D in [0, 1, 2, 3]
punpckhbw: covered by v8x16.shuffle 8 24 9 25 10 26 11 27 12 28 13 29 14 30 15 31
punpcklbw: covered by v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
punpckhwd: covered by v16x8.shuffle 4 12 5 13 6 14 7 15
punpcklwd: covered by v16x8.shuffle 0 8 1 9 2 10 3 11 4 12
punpckhdq/unpckhps: covered by v32x4.shuffle 2 6 3 7
punpckldq/unpcklps: covered by v32x4.shuffle 0 4 1 5
punpckhqdq/unpckhpd: covered by v64x2.shuffle 1 3
punpcklqdq/unpcklpd: covered by v64x2.shuffle 0 2
_MM_TRANSPOSE4_PS: covered by v64x2.shuffle 3 1x2 + v64x2.shuffle 0 2x2 + v32x4.shuffle 2 6 3 7x2 + v32x4.shuffle 0 4 1 5x2 (this function is composed of 8 SSE instructions)
SSE3:
movddup: covered by v64x2.shuffle 0 0
movhdup: covered by v32x4.permute 1 1 3 3
movldup: covered by v32x4.permute 0 0 2 2
SSSE3:
pshufb: covered by v8x16.shuffleVar
AVX:
vpermilps: covered by v32x4.shuffleVar
vpermilpd: covered by v64x2.shuffleVar
AVX2:
vpbroadcastb: covered by v8x16.permute 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
vpbraodcastw: covered by v16x8.permute 0 0 0 0 0 0 0 0
vpbroadcastd/vbroadcastss: covered by v32x4.permute 0 0 0 0
vpbroadcastq/vbroadcastsd: covered by v64x2.permute 0 0
AVX512:
vpermi2b: covered by v8x16.shuffleVar

Neon:

vswp: covered partially by v64x2.permute 1 0 (other use cases are just renaming)
vrev16.8: covered by v8x16.permute 1 0 3 2 5 4 7 6 9 8 11 10 13 12 15 14
vrev32.8: covered by v8x16.permute 3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12
vrev32.16: covered by v16x8.permute 1 0 3 2 5 4 7 6
vrev64.8: covered by v8x16.permute 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
vrev64.16: covered by v16x8.permute 3 2 1 0 7 6 5 4
vrev64.32: covered by v32x4.permute 1 0 3 2
vext: covered by v8x16.shuffle (or v16x8.shuffle or v32x4.shuffle or v64x2.shuffle for some shifts)
vtrn.8: covered by v8x16.shufflex2
vtrn.16: covered by v16x8.shufflex2
vtrn.32: covered by v32x4.shufflex2
vzip.8: covered by v8x16.shuffle 8 24 9 25 10 26 11 27 12 28 13 29 14 30 15 31 + v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
vzip.16: covered by v16x8.shuffle 4 12 5 13 6 14 7 15 + v16x8.shuffle 0 8 1 9 2 10 3 11 4 12
vzip.32: covered by v32x4.shuffle 2 6 3 7 + v32x4.shuffle 0 4 1 5
vuzp.8: covered by v8x16.shuffle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 + v8x16.shuffle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
vuzp.16: covered by v16x8.shuffle 0 2 4 6 8 10 12 14 + v16x8.shuffle 1 3 5 7 9 11 13 15
vuzp.32: covered by v32x4.shuffle 0 2 4 6 + v32x4.shuffle 1 3 5 7
vtbl: covered by v8x16.permuteVar for 1 vector table and by v8x16.shuffleVar for 2 vector table (bigger tables need an emulation with this design: multiple shuffleVars + selects)
vtblx: not supported (same as vtbl + select)

If somebody wants for Altivec/VSX, I can also do it.
I stopped here as I made my point for this question.

What kind of applications benefit

I will answer with low level algorithms.
It can be used for the following (non-exhaustive list):

in register transposition: for matrix transposition, or conversion between SoA <-> AoS
reductions (like showed in the last paragraph)
moving window: aka shift accross vectors (for FIR filters)
interleave data: for complex arithmetic
table lookup (to a certain extent)
endianness conversion
...

What performance gains they see (assuming a compiler couldn't just have figured out the right instructions already)

Those instructions cannot be emulated efficiently without any form of shuffling.
The only way to do it would be to fallback to scalar emulation.

That being said, if you compare with the shuffle instruction that was there before this MR (v8x16.shuffle), most of my instructions are redundant.
Only the shuffling instructions with a runtime shuffling index vector will be new.
Such an instruction will mainly be useful for table lookups in small tables.
In that case, without this MR, it also needs scalar emulation.

Now, I expect the redundancy to help the translation WASM->ASM with the pattern recognition.
Better recognized pattern would better performance. But this is only speculation at this stage.

What .wasm binary size difference we get

Compared to the situation without any shuffling operation, it will be a huge gain because we don't need scalar emulation for these.

Now, the comparison with only v8x16.shuffle:

Instruction	nb inputs	immediate size	binary coding size
`v8x16.permute`	1	64 bits	10 B
`v16x8.permute`	1	24 bits	5 B
`v32x4.permute`	1	8 bits	3 B
`v64x2.permute`	1	2 bits	3 B
`v8x16.shuffle`	2	128 bits	18 B
`v16x8.shuffle`	2	48 bits	8 B
`v32x4.shuffle`	2	16 bits	4 B
`v64x2.shuffle`	2	4 bits	3 B
`v8x16.permuteVar`	2	0 bits	2 B
`v16x8.permuteVar`	2	0 bits	2 B
`v32x4.permuteVar`	2	0 bits	2 B
`v64x2.permuteVar`	2	0 bits	2 B
`v8x16.shuffleVar`	3	0 bits	2 B
`v16x8.shuffleVar`	3	0 bits	2 B
`v32x4.shuffleVar`	3	0 bits	2 B
`v64x2.shuffleVar`	3	0 bits	2 B

This table makes it pretty clear that the v8x16.shuffle has a huge coding overhead that the other instructions mitigate.
To be noted that if you need a permute, but only have a shuffle instruction, you need to first duplicate the input on the stack (2 more Bytes?).
The variable variants require to first compute the vector of indices that would take more place in the binary than the constant version with an immediate.

However, I have no idea how big will be the difference on a complete program.

I hope this would allow some discussions before some deep analysis.

gnzlbg · 2018-08-07T16:08:13Z

To be noted that if you need a permute, but only have a shuffle instruction, you need to first duplicate the input on the stack (2 more Bytes?).

On x86 you can call pshufd xmm0, xmm0, imm with the same vector register twice, so I don't think one needs to neither duplicate the vector, nor pass it through the stack. I don't know WASM, but this seems like a pretty basic feature of any ISA, and it would surprise me if WASM wouldn't support it.

Can the shuffling operations with immediate be avoided relying on detecting constant rule?

WASM to machine code compilers are not necessarily optimizing compilers. So I don't think this can be avoided because that would introduce a pretty big performance cliff when the immediate mode operand is not correctly identified as such (more on this below).

Only the shuffling instructions with a runtime shuffling index vector will be new.

AFAIK x86/x86_64+SSSE3, arm32/64+neon, ... only support shuffling bytes with run-time indices. The moment you want to shuffle v16x8, v32x4, v64x2, ... you are out-of-luck.

So why add these shuffle instructions without immediate mode arguments for all vector types when no ISA really supports them ?

Also, what is the best code that a machine code generator could generate from such shuffles for vectors in general? Well if the vector is a v8x16, and we have an immediate mode argument, or an optimizing machine code generator, then the best we can get is one "shuffle bytes" / "table lookup" instruction on x86 and arm.

Otherwise, it would need to copy the vector (or two vectors in the case of shuffles) back to memory, allocate another vector, perform a scalar shuffle into it, and copy the result back to a vector register.

That's pretty bad from the point of view of code size and performance if the "user" (compiler targetting WASM, user, ...) provides run-time indices instead of an immediate mode operand for whatever reason (optimizations failed, bug, user didn't know...).

All in all, I think that:

Adding an instruction to shuffle bytes (v8x16) with run-time indices could be proposed, since this is supported by common ISAs (SSSE3 and neon at least, which means this is typically available at least on most browsers compiling WASM).
Adding a permute instruction is not necessary:
- with immediate mode arguments a machine code generator without optimizations can lower shuffle x x ids to the appropriate permute instructions if the same vector is used twice, or even if different vectors are used but the lane indices only index into one of them.
- I am unconvinced that any potential code size reduction will be significant enough to be observable in practice. In my experience, most programs don't use shuffles, and those who do, use it in a tiny fraction of the code, such that saving a couple of bytes here and there won't really have a real impact in real programs. There are obviously counter examples: e.g. a library of matrix multiplication algorithms might use shuffles a lot, but even there, I am skeptical about the code size saving that a new instruction for permutes could deliver. I guess I just value keeping the ISA simple more.
Adding shuffles and permutes with run-time indices lacks a bigger motivation. AFAICT no widely used architecture really supports them yet. LLVM might be adding support for this to be able to target RISC-V vector extensions, but this is all in too much in flux for my taste.

lemaitre · 2018-08-07T16:59:08Z

On x86 you can call pshufd xmm0, xmm0, imm with the same vector register twice, so I don't think one needs to neither duplicate the vector, nor pass it through the stack. I don't know WASM, but this seems like a pretty basic feature of any ISA, and it would surprise me if WASM wouldn't support it.

WASM is a stack ISA: you cannot access individual registers.
An instruction will pop as many registers from the stack as needed.

So an instruction taking 2 arguments needs to pop 2 registers from the stack.
If you want the 2 arguments to be the same, then you need to first duplicate the top register.
But this should probably not generate any duplication instruction on the actual machine.

Can the shuffling operations with immediate be avoided relying on detecting constant rule?

WASM to machine code compilers are not necessarily optimizing compilers. So I don't think this can be avoided because that would introduce a pretty big performance cliff when the immediate mode operand is not correctly identified as such (more on this below).

That was also my feeling, but I think the question needed to be asked.

AFAIK x86/x86_64+SSSE3, arm32/64+neon, ... only support shuffling bytes with run-time indices. The moment you want to shuffle v16x8, v32x4, v64x2, ... you are out-of-luck.

So why add shuffles for all vector types with run-time indices when these are not supported ?

That's not true. While I agree implementations lack some of them, support for those are better and better.
AVX supports shuffling 1x v32x4 and 1x v64x2. You can also emulate 1x v16x8 with gather. And you can emulate 2x vNxM with 1x vNxM.
AVX512 supports shuffling 2x v32x4 2x v64x2.

Also, while it cannot be done in one instruction, any variant can be implemented from 1x v8x16.

My point is: why penalize everybody by having a single one when some architectures support some of them and would benefit from them?

Also, what is the best code that a machine code generator could generate from such shuffles for vectors in general? Well if the vector is a v8x16, then just one instruction on x86 and arm.

Otherwise, it would need to copy the vector (or two vectors in the case of shuffles) back to memory, allocate another vector, perform a scalar shuffle into it, and copy the result back to a vector register.

You don't have to go back to scalar to emulate all the others.
And you can actually be pretty smart on how you do the emulation if you have some versions at your disposal.

The idea would be: if a user needs 1x v8x16 fine, they can use it.
But if their code need 2x v32x4, they should also be able to use it and it will either use an instruction doing just that, or have the best emulation possible for the current architecture.

But if you provide only 1x v8x16, the user (/compiler) will need to emulate 2x v32x4 for WASM.
This would lead to worse performance on all platforms but the ones supporting only this instruction (not so common anymore).

That's pretty bad from the point of view of code size and performance if the "user" (compiler targetting WASM, user, ...) provides run-time indices instead of an immediate mode operand for whatever reason (optimizations failed, bug, user didn't know...).

Or the code they are writing cannot be written with immediates: table lookups are a good example.

All in all, I think that:

Adding an instruction to shuffle bytes (v8x16) with run-time indices could be proposed, since this is supported by common ISAs (x86 SSSE3 and newer, and arm neon at least).

Adding a permute instruction is not necessary

Well, the overhead of 2 bytes to duplicate the top register to emulate a permute with a shuffle is probably fine.
So I also think it is fine to have only 2 input shuffling.
I think it was important to ask the question, though.

Adding shuffles and permutes with run-time indices lacks a bigger motivation. AFAICT no widely used architecture really supports them yet. LLVM might be adding support for this to be able to target RISC-V vector extensions, but this is all in too much in flux for my taste.

Well, AVX512 supports almost all variants of run-time indices. Only the 16 bits elements are not supported.

With all that being said, I think it is crucial to have the extra element size variants:

It makes the ISA nice and clean
With immediates indices, it allows a nice code reduction when shuffling is used heavily (more common than you might think). This also simplifies the pattern recognition of the shuffling rules for longer elements and might make the JIT compiler faster (no numbers on that...)
With runtime indices, it allows to adapt the code generation to better fit the underlying architecture

gnzlbg · 2018-08-07T19:24:11Z

Also, while it cannot be done in one instruction, any variant can be implemented from 1x v8x16.

[...]

You don't have to go back to scalar to emulate all the others.
And you can actually be pretty smart on how you do the emulation if you have some versions at your disposal.

Could you mention how this is done? More specifically, how can you do this without knowing the values of the indices at compile-time?

My point is: why penalize everybody by having a single one when some architectures support some of them and would benefit from them?

Currently, WASM has no vector ISA, only a proposed one. This vector ISA is conservative. It only supports 128-bit wide vectors, it doesn't provide horizontal reductions, etc. Why? Because WASM is shipped over the internet to unknown machines, where it has to be compiled and run blazingly fast reliably so that a webpage can render instantaneously. The most common hardware were WASM runs is x86 desktops, and the billions of arm Android, iOS, ... devices.

Shuffling v8x16s with run-time indices is supported on SSSE3, arm32+neon, arm64+neon, powerpc+altivec... which is pretty much all hardware in which WASM currently runs, and I don't think that adding these would be a very controversial addition to the spec with the right motivation (in which domains and for what applications are they important, etc.).

On the other hand, when you asked the question we were talking explicitly about shuffling with run-time indices for all vectors but v8x16. You state that WASM should support these intrinsics because "some architectures support some of them", where "some" actually means that some of the intrinsics that you propose adding are not supported by any architecture while the rest is supported only by x86+AVX/AVX2/AVX-512.

That's a tiny fraction of the devices on which WASM runs, everywhere else they would need to be emulated, potentially failing to deliver reliable performance, complicating the code generation backend making WASM potentially slower to compile, etc.

You mention that some of these are supported in AVX-512 as if this were an argument to add them, but AVX-512 is a very controversial ISA often described as "horrible" (which might mean that the instructions it offers won't be offered by any other ISA), which is supported by basically zero hardware currently used to browse the internet, and even if were lucky to have WASM running on an AVX-512 machine, there is currently no consensus in the technical community about whether actually using AVX-512 at run-time is worth the trouble.

When you mention that WASM should have this or that instruction because it is available in AVX-512, I actually think that WASM shouldn't have it because it isn't available on AVX, SSE4.2, NEON, ALTIVEC, etc. In a nutshell, if an instruction is only available on AVX-512, I see that as a pretty strong argument against adding it to the ISA.

But if you provide only 1x v8x16, the user (/compiler) will need to emulate 2x v32x4 for WASM.

I don't see what's the issue with this. People don't often write WASM by hand, they write C, Rust, or some other higher level language, and they then run an optimizing compiler like LLVM that generates WASM, and which already has a framework for lowering vector shuffles to hardware.

I'd rather have these optimizing compiler do these optimizations than force the WASM compiler to become an optimizing compiler.

Well if the vector is a v8x16, and we have an immediate mode argument, or an optimizing machine code generator, then the best we can get is one "shuffle bytes" / "table lookup" instruction on x86 and arm.

Otherwise, it would need to copy the vector (or two vectors in the case of shuffles) back to memory, allocate another vector, perform a scalar shuffle into it, and copy the result back to a vector register.
That's pretty bad from the point of view of code size and performance if the "user" (compiler targetting WASM, user, ...) provides run-time indices instead of an immediate mode operand for whatever reason (optimizations failed, bug, user didn't know...).

Or the code they are writing cannot be written with immediates: table lookups are a good example.

A table lookup is just a shuffle with run-time indices. Note also, that I was talking about shuffling non-v8x16 vectors with run-time indices. The "user didn't know" refers to the precise case that you mention: a user wants to explicitly do something that cannot be written with immediates, like a table lookup, yet because they are shuffling non-v8x16 vectors they get potentially horrible cross-platform performance.

Adding an instruction to WASM doesn't fix this, just shifts the problem of generating efficient machine code to the code generator, which might not be an optimizing compiler. From the user POV, inspecting the assembly generated won't help them, because they will just see a single WASM instruction, which looks fast. I'd just rather have the optimizing compiler deal with this. Adding these to WASM appears to me to be unnecessary trouble, for little win.

With all that being said, I think it is crucial to have the extra element size variants:

FWIW, I have nothing against the extra element size variants, nor about adding shuffles with run-time indices for v8x16. I don't think these are controversial, and wish they would be proposed in a single, non-controversial PR, separated from everything else.

I have doubts about whether the value that the permutes add is worth it. I agree that they do add value (reduced code size), but this comes at the cost of increasing the ISA. Performance-wise, these should generate the same machine code as the shuffles on all target. Honestly, I'd just remove them. Once the uncontroversial parts are merged and implemented it will be easier to asses the code size concerns. If these turn to be significant, adding them might not be controversial, and you will have proof that adding them does solve real code size problems, rather than hypothetical ones.

I have serious doubts about shuffling / permutes with run-time indices for all vector types that are not v8x16 because cross-platform support for these is too limited. If you really want them it might make sense to split them in a different PR, so that the uncontroversial parts can make progress independently of what happens with these. They can always be added in a backwards compatible way in the version 1.x of the ISA, so this shouldn't really be a big deal.

lemaitre · 2018-08-07T21:24:23Z

Also, while it cannot be done in one instruction, any variant can be implemented from 1x v8x16.

Could you mention how this is done? More specifically, how can you do this without knowing the values of the indices at compile-time?

Here is one example to implement 1x v32x4 in SSSE3:

__m128i _mm_permutevar_epi32(__m128i a, __m128i b) {
    __m128i off  = _mm_set_epi8(3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0);
    __m128i shuffle  = _mm_set_epi8(12, 12, 12, 12, 8, 8, 8, 8, 4, 4, 4, 4, 0, 0, 0, 0);
    __m128i v = _mm_shuffle_epi8(b, shuffle);
    v = _mm_add_epi8(_mm_slli_epi32(v, 2), off);
    return _mm_shuffle_epi8(a, v);
}

This boils down to only 4 instructions.

On the other hand, when you asked the question we were talking explicitly about shuffling with run-time indices for all vectors but v8x16. You state that WASM should support these intrinsics because "some architectures support some of them", where "some" actually means that some of the intrinsics that you propose adding are not supported by any architecture while the rest is supported only by x86+AVX/AVX2/AVX-512.

That's a tiny fraction of the devices on which WASM runs, everywhere else they would need to be emulated, potentially failing to deliver reliable performance, complicating the code generation backend making WASM potentially slower to compile, etc.

No: AVX machines represent a big proportion of the desktop machines and supports 1x v32x4 and 1x v64x2.
AVX has been here since 2011 for x86 machines (Sandy Bridge for Intel and Bulldozer for AMD).
Those are not a minority. In fact, I'm quite confident they are now the majority of all x86 user machines (not taking into account company servers).

Moreover, AVX512 will get into customer machines (next year?) so is a target of interest.

Now, my point is: a big part of the machines will be able to take advantage of of the extra information carried by the extra sizes.
Let's not cripple them just because some architectures don't support those natively.

You mention that some of these are supported in AVX-512 as if this were an argument to add them, but AVX-512 is a very controversial ISA often described as "horrible" (which might mean that the instructions it offers won't be offered by any other ISA), which is supported by basically zero hardware currently used to browse the internet, and even if were lucky to have WASM running on an AVX-512 machine, there is currently no consensus in the technical community about whether actually using AVX-512 at run-time is worth the trouble.

While AVX512 is far from perfect, it is the best SIMD ISA from Intel. The only real controversy about AVX512 is: Is 512 bit-wide SIMD worth it?

But this does not concern us here as we are only on short SIMD (128 bits).
But, AVX512 has still many instructions for 128 bit wide SIMD that simplifies some stuff.

When you mention that WASM should have this or that instruction because it is available in AVX-512, I actually think that WASM shouldn't have it because it isn't available on AVX, SSE4.2, NEON, ALTIVEC, etc. In a nutshell, if an instruction is only available on AVX-512, I see that as a pretty strong argument against adding it to the ISA.

It is not some random instruction whose semantic is ultra specific or unclear.
It is just makes something you agreed was useful and extend it to be complete.

I would add: this is the kind of opinion that lead to SSE/AVX where some instructions are missing for some types, but do exist for other types. SSE, AVX (and AVX512 for that matter) have many inconsistencies and have holes in the ISA features.
Those inconsistencies make it painfully to use those ISAs.

That's what I want to avoid here.

And as I said, providing more generic instructions will allow a better translation on some common architectures because it carries more information.

But if you provide only 1x v8x16, the user (/compiler) will need to emulate 2x v32x4 for WASM.

I don't see what's the issue with this. People don't often write WASM by hand, they write C, Rust, or some other higher level language, and they then run an optimizing compiler like LLVM that generates WASM, and which already has a framework for lowering vector shuffles to hardware.

I'd rather have these optimizing compiler do these optimizations than force the WASM compiler to become an optimizing compiler.

But such a compiler cannot know on which architecture it will run, so it is better to carry more information that can be used by the WASM compiler.
This last step will not rely on heavy compilation: just translate the shuffle instruction into a predefined sequence doing the same action on the current platform.
It does not need to perform more optimizations with that design.

Adding an instruction to WASM doesn't fix this, just shifts the problem of generating efficient machine code to the code generator, which might not be an optimizing compiler. From the user POV, inspecting the assembly generated won't help them, because they will just see a single WASM instruction, which looks fast. I'd just rather have the optimizing compiler deal with this. Adding these to WASM appears to me to be unnecessary trouble, for little win.

The generation WASM->ASM will not be heavier for those instructions than for other as the optimal way to emulate it will be precomputed.

With all that being said, I think it is crucial to have the extra element size variants:

FWIW, I have nothing against the extra element size variants, nor about adding shuffles with run-time indices for v8x16. I don't think these are controversial, and wish they would be proposed in a single, non-controversial PR, separated from everything else.

This could be done, indeed.
I'll do it tomorrow.

I have doubts about whether the value that the permutes add is worth it. I agree that they do add value (reduced code size), but this comes at the cost of increasing the ISA. Performance-wise, these should generate the same machine code as the shuffles on all target. Honestly, I'd just remove them. Once the uncontroversial parts are merged and implemented it will be easier to asses the code size concerns. If these turn to be significant, adding them might not be controversial, and you will have proof that adding them does solve real code size problems, rather than hypothetical ones.

I have also my doubts on this subject, hence my question.
You seem to think they're not worth it currently, and I'm completely fine with that.

lemaitre · 2018-08-08T21:40:23Z

Ok, I removed the single input permute instructions, and the runtime indices shuffles except v8x16.shuffleVar.

Now I'm wondering. Many names have been used to designate these operations:
shuffle, permute, swizzle, blend, table lookup.
Is shuffle really the best name? Is shuffleVar a good name for runtime indices?

I have no strong opinion on the subject.

gnzlbg · 2018-08-08T22:43:57Z

Is shuffleVar a good name for runtime indices?

Good question. I like it. This is why.

shuffle, permute, swizzle, blend, table lookup.

I am not a native english speaker, but all of these sound like pretty much synonyms to me.

Ideally, we could just use shuffle, and add a second encoding of the instruction that accepts a non-immediate mode operand. That would make the ISA cleaner. I don't think this is a good idea when the cost of an instruction can vary a lot between immediate and non-immediate operands. Just give it a different name and call it a day. (e.g. AMD used the i suffix to denote the bit extract intrinsic bextri that accepted immediate mode arguments in the TBM instruction set).

So we could call the dynamic shuffle shuffle, and the shuffle with immediate shufflei. Whether we use shuffle and shuffleVar instead, I don't care much either. I just want this operations, with minimal performance footguns.

Calling the new instruction shuffleVar initially, sounds more than fair. I might not be able to realize what Var stands for initially, but that's something that I will discover when googling for the instruction.

proposals/simd/SIMD.md

gnzlbg · 2018-12-13T10:15:44Z

FWIW, Rust's packed_simd module ended up implementing the 1 argument variant of these only.

The name that got consensus there for these after some bikeshedding was shuffle_dyn1 (_dyn for dynamic - variable had confusing connotations in programming) reserving the possibility of introducing a two-argument shuffle_dyn in the future. That is, the set that Rust ended up implementing for 128-bit wide vectors is:

v8x16.shuffle_dyn1(a: v128, indices: v128) -> v128
v16x8.shuffle_dyn1(a: v128, indices: v128) -> v128
v32x4.shuffle_dyn1(a: v128, indices: v128) -> v128
v64x2.shuffle_dyn1(a: v128, indices: v128) -> v128

Rust provides efficient implementations of these for arm32+v7+neon, arm64+asimd, x86/x86_64 + SSSE3, x86/x86_64 + AVX. Efficient implementations for powerpc should also be possible. I don't know about MIPS and RISCV - we haven't really started those yet.

lemaitre · 2018-12-13T10:34:49Z

I think it is much easier to write one input shuffle using two input shuffle than the opposite.
So I think it is better to first specify two input shuffles.

And for the WASM -> ASM, the pattern of implementing one input shuffle with two input shuffle is also easier to detect to generate the appropriate instruction.

gnzlbg · 2018-12-13T10:38:56Z

I think it is much easier to write one input shuffle using two input shuffle than the opposite.

That is possibly true. Do you have or can point us to an implementation of the two input dynamic shuffle for the relevant vector types on the most common modern platforms (arm32, arm64, x86_64 SSE, and x86_64 AVX would suffice for me, but ppc64le would also be nice) ?

I'd like to check how hard are those to implement, and also the performance of implementing a one input shuffle version on top of the two input shuffle ones, instead of having tailored algorithms for one input shuffles directly.

My experience with the arm and x86 implementations gives me the feeling that the machine code generated for the two input element versions would need to be significantly different from the single element one.

lemaitre · 2018-12-13T11:14:03Z

Ah sorry, I missed that you suggested one input dynamic shuffles.
Then, it's quite the opposite... (but it's still doable)

gnzlbg · 2018-12-13T11:30:55Z

Ah yes, I was talking about one input dynamic shuffles! I guess we are on the same page!

I agree that implementing the two input version is definitely possible, we haven't done it yet because there has been little demand, and its mapping to hardware and its performance are not as straightforward to reason about as the single input version (multi-instruction sequences vs often just a single instruction).

Do you think it could make sense to also offer a single input dynamic shuffle instruction ? If so, maybe it might be worth it to start by adding the single instruction dynamic shuffle version, while keeping the door open for a two input dynamic shuffle version in the future. The single instruction version maps in a straightforward way to most hardware, which makes it uncontroversial, and delivers instant value.

gnzlbg · 2019-03-02T09:48:06Z

@lemaitre could you resolve the conflicts ?

@dtig would it be possible to review this? as proposed this is (1) super useful, and (2) easily implementable on most architectures.

lemaitre · 2019-03-02T11:41:05Z

@lemaitre could you resolve the conflicts ?

I've just done the merge from master.
Also, I changed the dynamic indices version from 2 inputs to 1 input as suggested previously.

Tell me if everything is correct now.

proposals/simd/SIMD.md

gnzlbg · 2019-03-02T14:13:19Z

So this LGTM. Optionally it could also modify the binary encoding to include these, but maybe it is better to wait on @dtig review.

proposals/simd/SIMD.md

tlively · 2019-03-06T18:52:02Z

I think the discussion here is missing justification for why these instructions should be added. We need to know specific classes of applications that would benefit and we need to have tentative performance and/or code size numbers that would show that including these instructions would be worth the extra implementation effort.

In particular, the variable-index permute seems very complex and non-portable based on discussions above. WebAssembly instructions are meant to map very simply to the underlying native instructions so that baseline compilers in engines do not need to do much work, so we actively do not want abstractions in the instruction set. What specific use cases do variable-index permutes address? Are there more portable formulations of it that would still be useful?

zeux · 2019-03-06T22:52:38Z

@tlively As one example, see comment #24 (comment). The algorithm implemented there needs v8x16.permute_dyn and isn't practical without it.

dtig · 2019-03-06T23:23:02Z

Would it be possible to get numbers that are specific to v8x16.permute_dyn? I guess what I'm looking for here is concrete data to be able to justify the complexity - this could even be just a native C++ comparison, i.e. a difference in performance on native using pshufb vs an emulated dynamic shuffle. The coarse grained data of with/without SIMD though helpful do not justify the addition of this particular set of permutations.

gnzlbg · 2019-03-06T23:56:03Z

There are three different things being proposed in this issue.

Encoding of shuffle immediates

@lemaitre has already hinted that it makes sense to them to split that discussion, and @dtig has also commented that they would find that useful, so let's assume that this will happen.

More shuffle instructions with immediate indices

These are useful, e.g., for adding all f32s in a f32x4 vector, and widely supported on all architectures (x64, arm, ppc, etc.), but they are not strictly necessary because one can just use i8x16.shuffle instead.

The question is IMO whether they are worth it. The two pros I can think of here is that they reduce binary size, and that they might simplify the WASM machine code generator, because instead of having to recognize that a particular i8x16.shuffle index sequence actually expresses a i32x4 shuffle, they can just lower it straightforwardly to optimal code.

I have no idea whether this is the case, but maybe @sunfishcode can chime in and comment about how easy it is for Cranelift to generate optimal code for i32x4 shuffles from i8x16 indices or similar? EDIT: and maybe also comment about what they think about these other shuffle variants.

Dynamic shuffles

The i8x16.permute_dyn is supported on all mainstream architectures (x64 SSE, NEON, PPC), and it cannot be emulated ~~easily~~ with static shuffles.

The question here seems to be whether this is useful / whether it allows to write faster programs.

The classical Fannkuch Redux benchmark (benchmarks game and paper) show that the two fastest implementations in C and C++ (1.5x faster than the 3rd place) actually use this intrinsic explicitly, which is called _mm_shuffle_epi8 and lowers to pshufb. On ARM it would just lower to vtbl (which stands for table-lookup, another useful thing that this can be applied to).

In Rust portable packed SIMD we call this shuffle1_dyn (shuffle 1-argument dynamic indices), and the Fannkuch Benchmark became 1.3-1.5 faster with it on x86 (https://github.com/rust-lang-nursery/packed_simd/blob/master/examples/fannkuch_redux/src/simd.rs#L59). @hsivonen might be able to run the benchmark on arm64 and report the results for the scalar and vectorized implementations, e.g., on Android. Note, however, that amr64 has even better support for look-up tables than arm32 and packed_simd exploits this.

EDIT: so IMO i8x16.permute_dyn is well supported across the board, useful, and worth adding. (This PR originally proposed adding more permute_dyn variants, but those were not very well supported so they were dropped).

gnzlbg · 2019-03-07T00:00:00Z

Maybe it might be worth it to split this into those three issues, so that we can discuss and resolve them independently of each other, and so that reaching consensus on one of the issues is not blocked by the other two.

gnzlbg · 2019-03-07T00:38:04Z

Also, I just recalled another application of i8x16.permute_var, SIMD UTF-8 validation (e.g. see https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h#L35 - there is a blog post here https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/).

Basically, this instruction is useful whenever you need to, given some run-time indices, perform a table-look up. This happens often when encoding / decoding anything that's barely complex (UTF-8 , DNA, etc.).

zeux · 2019-03-07T00:58:28Z

permute_dyn is critical for any byte wise processing where the structure isn't fixed width - this is how my decoder uses it as well. I'll try to measure the performance of the C++ code with emulation using 16 scalar table fetches.

penzn · 2019-03-07T02:47:40Z

@gnzlbg, thank you for summarizing (and thank you @lemaitre for bringing this up). Those are significant, we should try to reach some conclusion on all three. Does either one of you want to post this as three separate proposals (issues)? I can do it myself tomorrow, but I think you can probably explain it better than I do.

zeux · 2019-03-07T04:04:05Z

Added more elaborate benchmark numbers targeting specifically the lack of dynamic permute: #24 (comment)

penzn · 2019-03-07T20:00:22Z

Filed #68, #69, and finally #70. To me this feels like something we can move post-MVP when we have more examples of WASM SIMD running in the wild.

proposals/simd/SIMD.md

lemaitre · 2019-03-31T15:41:05Z

As it seems the community as reached a consensus on other PR/issues, I updated my PR to take into account all the changes.

baryluk · 2019-10-31T03:01:02Z

This PR would also help with targeting ARM SVE, to map into one of the permute with immediate (DUP, EXT, INSR, REV, REVB, REVH, REVW, SUNPKHI, SUNPKLO, TRN1, TRN2, UUNPKHI, UUNPKLO, UZP1, UZP2, ZIP1, ZIP2), copy/broadcasts (CPY, DUP, FCPY, FDUP, SEL) and emulate the rest using permute with extra vector or predicate register (COMPACT, SPLICE or TBL).

"ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A"
https://developer.arm.com//docs/ddi0584/ad/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
https://developer.arm.com/-/media/developer/products/architecture/DDI0584A_d_SVE.zip

https://static.docs.arm.com/ddi0584/ae/DDI0584A_e_SVE_supp_armv8A.pdf

https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_00_en.pdf

https://static.docs.arm.com/101726/0100/porting_and_optimizing_hpc_applications_for_arm_sve_101726_0100_00_en.pdf

For details.

tlively · 2021-02-02T02:55:25Z

Closing since we have consensus on our current strategy for shuffles.

lemaitre added 2 commits April 19, 2018 16:24

Added permutation and shuffling primitives

2a62b90

Add reduction paragraph

219cc12

reductions are computed with permutes

This was referenced Apr 19, 2018

Byte shuffle / table lookup operations #24

Open

Consider adding Horizontal Add #20

Open

Alternative to Swizzle / Shuffle #8

Closed

Shorter encoding for reduce add on f32x4

f25b996

gnzlbg mentioned this pull request Aug 8, 2018

Allow out-of-range lane indices in swizzle and shuffle instructions #11

Closed

Removed polemical shuffle instructions

9146aeb

binji reviewed Aug 10, 2018

View reviewed changes

proposals/simd/SIMD.md Outdated Show resolved Hide resolved

snake case for shuffle_var

89b7245

gnzlbg reviewed Dec 13, 2018

View reviewed changes

proposals/simd/SIMD.md Outdated Show resolved Hide resolved

gnzlbg mentioned this pull request Dec 13, 2018

Are shuffle's lane indices dynamic? #33

Closed

Update SIMD.md

64dc7ae

lemaitre added 2 commits March 2, 2019 12:29

Merge remote-tracking branch 'wasm/master' into lemaitre-shuffles

59543e5

Fixed paragraph positionning

12aed66

gnzlbg reviewed Mar 2, 2019

View reviewed changes

proposals/simd/SIMD.md Outdated Show resolved Hide resolved

Fixed length of v8x16.shuffle immediate

c3ce95a

lemaitre commented Mar 6, 2019

View reviewed changes

proposals/simd/SIMD.md Outdated Show resolved Hide resolved

This was referenced Mar 7, 2019

[RFC] Dynamic shuffle #68

Closed

Packed lane indices #69

Closed

Add other shuffles back? #70

Open

dtig reviewed Mar 7, 2019

View reviewed changes

proposals/simd/SIMD.md Outdated Show resolved Hide resolved

lemaitre added 2 commits March 31, 2019 17:30

Merge remote-tracking branch 'wasm/master' into lemaitre-shuffles

408e62c

Updated Binary and text encoding

3152128

lemaitre changed the title ~~Shuffle and permute specification~~ Shuffle with immediate indices specification Mar 31, 2019

tlively closed this Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle with immediate indices specification #30

Shuffle with immediate indices specification #30

lemaitre commented Apr 19, 2018 •

edited

Loading

jfbastien commented Apr 19, 2018

lemaitre commented Apr 19, 2018 •

edited

Loading

gnzlbg commented Aug 7, 2018 •

edited

Loading

lemaitre commented Aug 7, 2018

gnzlbg commented Aug 7, 2018 •

edited

Loading

lemaitre commented Aug 7, 2018

lemaitre commented Aug 8, 2018

gnzlbg commented Aug 8, 2018 •

edited

Loading

gnzlbg commented Dec 13, 2018

lemaitre commented Dec 13, 2018

gnzlbg commented Dec 13, 2018 •

edited

Loading

lemaitre commented Dec 13, 2018

gnzlbg commented Dec 13, 2018 •

edited

Loading

gnzlbg commented Mar 2, 2019

lemaitre commented Mar 2, 2019

gnzlbg commented Mar 2, 2019

tlively commented Mar 6, 2019

zeux commented Mar 6, 2019

dtig commented Mar 6, 2019

gnzlbg commented Mar 6, 2019 •

edited

Loading

gnzlbg commented Mar 7, 2019 •

edited

Loading

gnzlbg commented Mar 7, 2019 •

edited

Loading

zeux commented Mar 7, 2019

penzn commented Mar 7, 2019

zeux commented Mar 7, 2019

penzn commented Mar 7, 2019

lemaitre commented Mar 31, 2019

baryluk commented Oct 31, 2019

tlively commented Feb 2, 2021

Shuffle with immediate indices specification #30

Shuffle with immediate indices specification #30

Conversation

lemaitre commented Apr 19, 2018 • edited Loading

jfbastien commented Apr 19, 2018

lemaitre commented Apr 19, 2018 • edited Loading

gnzlbg commented Aug 7, 2018 • edited Loading

lemaitre commented Aug 7, 2018

gnzlbg commented Aug 7, 2018 • edited Loading

lemaitre commented Aug 7, 2018

lemaitre commented Aug 8, 2018

gnzlbg commented Aug 8, 2018 • edited Loading

gnzlbg commented Dec 13, 2018

lemaitre commented Dec 13, 2018

gnzlbg commented Dec 13, 2018 • edited Loading

lemaitre commented Dec 13, 2018

gnzlbg commented Dec 13, 2018 • edited Loading

gnzlbg commented Mar 2, 2019

lemaitre commented Mar 2, 2019

gnzlbg commented Mar 2, 2019

tlively commented Mar 6, 2019

zeux commented Mar 6, 2019

dtig commented Mar 6, 2019

gnzlbg commented Mar 6, 2019 • edited Loading

Encoding of shuffle immediates

More shuffle instructions with immediate indices

Dynamic shuffles

gnzlbg commented Mar 7, 2019 • edited Loading

gnzlbg commented Mar 7, 2019 • edited Loading

zeux commented Mar 7, 2019

penzn commented Mar 7, 2019

zeux commented Mar 7, 2019

penzn commented Mar 7, 2019

lemaitre commented Mar 31, 2019

baryluk commented Oct 31, 2019

tlively commented Feb 2, 2021

lemaitre commented Apr 19, 2018 •

edited

Loading

lemaitre commented Apr 19, 2018 •

edited

Loading

gnzlbg commented Aug 7, 2018 •

edited

Loading

gnzlbg commented Aug 7, 2018 •

edited

Loading

gnzlbg commented Aug 8, 2018 •

edited

Loading

gnzlbg commented Dec 13, 2018 •

edited

Loading

gnzlbg commented Dec 13, 2018 •

edited

Loading

gnzlbg commented Mar 6, 2019 •

edited

Loading

gnzlbg commented Mar 7, 2019 •

edited

Loading

gnzlbg commented Mar 7, 2019 •

edited

Loading