Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Shuffle with immediate indices specification #30

Closed
wants to merge 11 commits into from

Conversation

lemaitre
Copy link

@lemaitre lemaitre commented Apr 19, 2018

This a PR for shuffling instructions with immediate indices.
It aims to add back the instructions v16x8.suffle2_imm, v32x4.shuffle2_imm, v64x2.shuffle2_imm into WebAssembly.

These instructions enables better and simpler pattern matching from the WASM->ASM virtual machine to assure best performance.
Indeed, while v8x16.shuffle_imm2 can be used to emulate all the others, it is tedious to recognize the shuffling rules to use the proper instruction of the target platform.

EDITED: Old PR description Hello everyone,

I created a small specification for shuffle and permute operations.
My intent with this merge request is mainly to have a discussion.

Few general questions:

  • How easy/difficult is it to implement?
  • How efficient this can be?
  • Are some operations/use-cases missing?
  • What are the caveats of such an approach?
  • ...

More specialized questions:

  • Is permute required to avoid input duplication?
  • Can the shuffling operations with immediate be avoided relying on detecting constant rule?

Any comment on this is welcome.

@jfbastien
Copy link
Member

In general the way we've approached SIMD instructions is to see what kind of code would benefit from these opcodes, and how. Can you document in this PR:

  • What opcodes you want to capture for x86 and ARM (and each one's various architecture revisions)
  • What kind of applications benefit
  • What performance gains they see (assuming a compiler couldn't just have figured out the right instructions already)
  • What .wasm binary size difference we get

@lemaitre
Copy link
Author

lemaitre commented Apr 19, 2018

I want first to mention that the position of WASM is a bit different than regular ISA:
it need (a bit) more abstraction in order to have efficient implementation on architectures with different instructions.
In that respect, I think WASM should provide high-level-ish operations and generic shuffles fit completely in this scheme.

  • What opcodes you want to capture for x86 and ARM (and each one's various architecture revisions)

Basically every shuffling operations supported by an architecture:
SSE2:

  • movhlps: covered by v32x4.shuffle 6 7 3 4 or v64x2.shuffle 3 1
  • movlhps: covered by v32x4.shuffle 0 1 4 5 or v64x2.shuffle 0 2
  • movsd: covered by v64x2.shuffle 2 1
  • movss: covered by v32x4.shuffle 4 1 2 3
  • pshufd: covered by v32x4.permute A B C D with [0, 1, 2, 3]
  • shufpd: covered by v64x2.shuffle A B with A in [0, 1] and B in [2, 3]
  • shufps: covered by v32x4.shuffle A B C D with A B in [0, 1, 2, 3] and C D in [4, 5, 6, 7]
  • pshufhw: covered by v16x8.permute 0 1 2 3 A B C D with A B C D in [4, 5, 6, 7]
  • pshuflw: covered by v16x8.permute A B C D 4 5 6 7 with A B C D in [0, 1, 2, 3]
  • punpckhbw: covered by v8x16.shuffle 8 24 9 25 10 26 11 27 12 28 13 29 14 30 15 31
  • punpcklbw: covered by v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
  • punpckhwd: covered by v16x8.shuffle 4 12 5 13 6 14 7 15
  • punpcklwd: covered by v16x8.shuffle 0 8 1 9 2 10 3 11 4 12
  • punpckhdq/unpckhps: covered by v32x4.shuffle 2 6 3 7
  • punpckldq/unpcklps: covered by v32x4.shuffle 0 4 1 5
  • punpckhqdq/unpckhpd: covered by v64x2.shuffle 1 3
  • punpcklqdq/unpcklpd: covered by v64x2.shuffle 0 2
  • _MM_TRANSPOSE4_PS: covered by v64x2.shuffle 3 1x2 + v64x2.shuffle 0 2x2 + v32x4.shuffle 2 6 3 7x2 + v32x4.shuffle 0 4 1 5x2 (this function is composed of 8 SSE instructions)
    SSE3:
  • movddup: covered by v64x2.shuffle 0 0
  • movhdup: covered by v32x4.permute 1 1 3 3
  • movldup: covered by v32x4.permute 0 0 2 2
    SSSE3:
  • pshufb: covered by v8x16.shuffleVar
    AVX:
  • vpermilps: covered by v32x4.shuffleVar
  • vpermilpd: covered by v64x2.shuffleVar
    AVX2:
  • vpbroadcastb: covered by v8x16.permute 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • vpbraodcastw: covered by v16x8.permute 0 0 0 0 0 0 0 0
  • vpbroadcastd/vbroadcastss: covered by v32x4.permute 0 0 0 0
  • vpbroadcastq/vbroadcastsd: covered by v64x2.permute 0 0
    AVX512:
  • vpermi2b: covered by v8x16.shuffleVar

Neon:

  • vswp: covered partially by v64x2.permute 1 0 (other use cases are just renaming)
  • vrev16.8: covered by v8x16.permute 1 0 3 2 5 4 7 6 9 8 11 10 13 12 15 14
  • vrev32.8: covered by v8x16.permute 3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12
  • vrev32.16: covered by v16x8.permute 1 0 3 2 5 4 7 6
  • vrev64.8: covered by v8x16.permute 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
  • vrev64.16: covered by v16x8.permute 3 2 1 0 7 6 5 4
  • vrev64.32: covered by v32x4.permute 1 0 3 2
  • vext: covered by v8x16.shuffle (or v16x8.shuffle or v32x4.shuffle or v64x2.shuffle for some shifts)
  • vtrn.8: covered by v8x16.shufflex2
  • vtrn.16: covered by v16x8.shufflex2
  • vtrn.32: covered by v32x4.shufflex2
  • vzip.8: covered by v8x16.shuffle 8 24 9 25 10 26 11 27 12 28 13 29 14 30 15 31 + v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
  • vzip.16: covered by v16x8.shuffle 4 12 5 13 6 14 7 15 + v16x8.shuffle 0 8 1 9 2 10 3 11 4 12
  • vzip.32: covered by v32x4.shuffle 2 6 3 7 + v32x4.shuffle 0 4 1 5
  • vuzp.8: covered by v8x16.shuffle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 + v8x16.shuffle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
  • vuzp.16: covered by v16x8.shuffle 0 2 4 6 8 10 12 14 + v16x8.shuffle 1 3 5 7 9 11 13 15
  • vuzp.32: covered by v32x4.shuffle 0 2 4 6 + v32x4.shuffle 1 3 5 7
  • vtbl: covered by v8x16.permuteVar for 1 vector table and by v8x16.shuffleVar for 2 vector table (bigger tables need an emulation with this design: multiple shuffleVars + selects)
  • vtblx: not supported (same as vtbl + select)

If somebody wants for Altivec/VSX, I can also do it.
I stopped here as I made my point for this question.

  • What kind of applications benefit

I will answer with low level algorithms.
It can be used for the following (non-exhaustive list):

  • in register transposition: for matrix transposition, or conversion between SoA <-> AoS
  • reductions (like showed in the last paragraph)
  • moving window: aka shift accross vectors (for FIR filters)
  • interleave data: for complex arithmetic
  • table lookup (to a certain extent)
  • endianness conversion
  • ...
  • What performance gains they see (assuming a compiler couldn't just have figured out the right instructions already)

Those instructions cannot be emulated efficiently without any form of shuffling.
The only way to do it would be to fallback to scalar emulation.

That being said, if you compare with the shuffle instruction that was there before this MR (v8x16.shuffle), most of my instructions are redundant.
Only the shuffling instructions with a runtime shuffling index vector will be new.
Such an instruction will mainly be useful for table lookups in small tables.
In that case, without this MR, it also needs scalar emulation.

Now, I expect the redundancy to help the translation WASM->ASM with the pattern recognition.
Better recognized pattern would better performance. But this is only speculation at this stage.

  • What .wasm binary size difference we get

Compared to the situation without any shuffling operation, it will be a huge gain because we don't need scalar emulation for these.

Now, the comparison with only v8x16.shuffle:

Instruction nb inputs immediate size binary coding size
v8x16.permute 1 64 bits 10 B
v16x8.permute 1 24 bits 5 B
v32x4.permute 1 8 bits 3 B
v64x2.permute 1 2 bits 3 B
v8x16.shuffle 2 128 bits 18 B
v16x8.shuffle 2 48 bits 8 B
v32x4.shuffle 2 16 bits 4 B
v64x2.shuffle 2 4 bits 3 B
v8x16.permuteVar 2 0 bits 2 B
v16x8.permuteVar 2 0 bits 2 B
v32x4.permuteVar 2 0 bits 2 B
v64x2.permuteVar 2 0 bits 2 B
v8x16.shuffleVar 3 0 bits 2 B
v16x8.shuffleVar 3 0 bits 2 B
v32x4.shuffleVar 3 0 bits 2 B
v64x2.shuffleVar 3 0 bits 2 B

This table makes it pretty clear that the v8x16.shuffle has a huge coding overhead that the other instructions mitigate.
To be noted that if you need a permute, but only have a shuffle instruction, you need to first duplicate the input on the stack (2 more Bytes?).
The variable variants require to first compute the vector of indices that would take more place in the binary than the constant version with an immediate.

However, I have no idea how big will be the difference on a complete program.

I hope this would allow some discussions before some deep analysis.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 7, 2018

To be noted that if you need a permute, but only have a shuffle instruction, you need to first duplicate the input on the stack (2 more Bytes?).

On x86 you can call pshufd xmm0, xmm0, imm with the same vector register twice, so I don't think one needs to neither duplicate the vector, nor pass it through the stack. I don't know WASM, but this seems like a pretty basic feature of any ISA, and it would surprise me if WASM wouldn't support it.

Can the shuffling operations with immediate be avoided relying on detecting constant rule?

WASM to machine code compilers are not necessarily optimizing compilers. So I don't think this can be avoided because that would introduce a pretty big performance cliff when the immediate mode operand is not correctly identified as such (more on this below).

Only the shuffling instructions with a runtime shuffling index vector will be new.

AFAIK x86/x86_64+SSSE3, arm32/64+neon, ... only support shuffling bytes with run-time indices. The moment you want to shuffle v16x8, v32x4, v64x2, ... you are out-of-luck.

So why add these shuffle instructions without immediate mode arguments for all vector types when no ISA really supports them ?

Also, what is the best code that a machine code generator could generate from such shuffles for vectors in general? Well if the vector is a v8x16, and we have an immediate mode argument, or an optimizing machine code generator, then the best we can get is one "shuffle bytes" / "table lookup" instruction on x86 and arm.

Otherwise, it would need to copy the vector (or two vectors in the case of shuffles) back to memory, allocate another vector, perform a scalar shuffle into it, and copy the result back to a vector register.

That's pretty bad from the point of view of code size and performance if the "user" (compiler targetting WASM, user, ...) provides run-time indices instead of an immediate mode operand for whatever reason (optimizations failed, bug, user didn't know...).


All in all, I think that:

  • Adding an instruction to shuffle bytes (v8x16) with run-time indices could be proposed, since this is supported by common ISAs (SSSE3 and neon at least, which means this is typically available at least on most browsers compiling WASM).

  • Adding a permute instruction is not necessary:

    • with immediate mode arguments a machine code generator without optimizations can lower shuffle x x ids to the appropriate permute instructions if the same vector is used twice, or even if different vectors are used but the lane indices only index into one of them.
    • I am unconvinced that any potential code size reduction will be significant enough to be observable in practice. In my experience, most programs don't use shuffles, and those who do, use it in a tiny fraction of the code, such that saving a couple of bytes here and there won't really have a real impact in real programs. There are obviously counter examples: e.g. a library of matrix multiplication algorithms might use shuffles a lot, but even there, I am skeptical about the code size saving that a new instruction for permutes could deliver. I guess I just value keeping the ISA simple more.
  • Adding shuffles and permutes with run-time indices lacks a bigger motivation. AFAICT no widely used architecture really supports them yet. LLVM might be adding support for this to be able to target RISC-V vector extensions, but this is all in too much in flux for my taste.

@lemaitre
Copy link
Author

lemaitre commented Aug 7, 2018

On x86 you can call pshufd xmm0, xmm0, imm with the same vector register twice, so I don't think one needs to neither duplicate the vector, nor pass it through the stack. I don't know WASM, but this seems like a pretty basic feature of any ISA, and it would surprise me if WASM wouldn't support it.

WASM is a stack ISA: you cannot access individual registers.
An instruction will pop as many registers from the stack as needed.

So an instruction taking 2 arguments needs to pop 2 registers from the stack.
If you want the 2 arguments to be the same, then you need to first duplicate the top register.
But this should probably not generate any duplication instruction on the actual machine.

Can the shuffling operations with immediate be avoided relying on detecting constant rule?

WASM to machine code compilers are not necessarily optimizing compilers. So I don't think this can be avoided because that would introduce a pretty big performance cliff when the immediate mode operand is not correctly identified as such (more on this below).

That was also my feeling, but I think the question needed to be asked.

AFAIK x86/x86_64+SSSE3, arm32/64+neon, ... only support shuffling bytes with run-time indices. The moment you want to shuffle v16x8, v32x4, v64x2, ... you are out-of-luck.

So why add shuffles for all vector types with run-time indices when these are not supported ?

That's not true. While I agree implementations lack some of them, support for those are better and better.
AVX supports shuffling 1x v32x4 and 1x v64x2. You can also emulate 1x v16x8 with gather. And you can emulate 2x vNxM with 1x vNxM.
AVX512 supports shuffling 2x v32x4 2x v64x2.

Also, while it cannot be done in one instruction, any variant can be implemented from 1x v8x16.

My point is: why penalize everybody by having a single one when some architectures support some of them and would benefit from them?

Also, what is the best code that a machine code generator could generate from such shuffles for vectors in general? Well if the vector is a v8x16, then just one instruction on x86 and arm.

Otherwise, it would need to copy the vector (or two vectors in the case of shuffles) back to memory, allocate another vector, perform a scalar shuffle into it, and copy the result back to a vector register.

You don't have to go back to scalar to emulate all the others.
And you can actually be pretty smart on how you do the emulation if you have some versions at your disposal.

The idea would be: if a user needs 1x v8x16 fine, they can use it.
But if their code need 2x v32x4, they should also be able to use it and it will either use an instruction doing just that, or have the best emulation possible for the current architecture.

But if you provide only 1x v8x16, the user (/compiler) will need to emulate 2x v32x4 for WASM.
This would lead to worse performance on all platforms but the ones supporting only this instruction (not so common anymore).

That's pretty bad from the point of view of code size and performance if the "user" (compiler targetting WASM, user, ...) provides run-time indices instead of an immediate mode operand for whatever reason (optimizations failed, bug, user didn't know...).

Or the code they are writing cannot be written with immediates: table lookups are a good example.

All in all, I think that:

  • Adding an instruction to shuffle bytes (v8x16) with run-time indices could be proposed, since this is supported by common ISAs (x86 SSSE3 and newer, and arm neon at least).

  • Adding a permute instruction is not necessary

Well, the overhead of 2 bytes to duplicate the top register to emulate a permute with a shuffle is probably fine.
So I also think it is fine to have only 2 input shuffling.
I think it was important to ask the question, though.

  • Adding shuffles and permutes with run-time indices lacks a bigger motivation. AFAICT no widely used architecture really supports them yet. LLVM might be adding support for this to be able to target RISC-V vector extensions, but this is all in too much in flux for my taste.

Well, AVX512 supports almost all variants of run-time indices. Only the 16 bits elements are not supported.


With all that being said, I think it is crucial to have the extra element size variants:

  • It makes the ISA nice and clean
  • With immediates indices, it allows a nice code reduction when shuffling is used heavily (more common than you might think). This also simplifies the pattern recognition of the shuffling rules for longer elements and might make the JIT compiler faster (no numbers on that...)
  • With runtime indices, it allows to adapt the code generation to better fit the underlying architecture

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 7, 2018

Also, while it cannot be done in one instruction, any variant can be implemented from 1x v8x16.

[...]

You don't have to go back to scalar to emulate all the others.
And you can actually be pretty smart on how you do the emulation if you have some versions at your disposal.

Could you mention how this is done? More specifically, how can you do this without knowing the values of the indices at compile-time?

My point is: why penalize everybody by having a single one when some architectures support some of them and would benefit from them?

Currently, WASM has no vector ISA, only a proposed one. This vector ISA is conservative. It only supports 128-bit wide vectors, it doesn't provide horizontal reductions, etc. Why? Because WASM is shipped over the internet to unknown machines, where it has to be compiled and run blazingly fast reliably so that a webpage can render instantaneously. The most common hardware were WASM runs is x86 desktops, and the billions of arm Android, iOS, ... devices.

Shuffling v8x16s with run-time indices is supported on SSSE3, arm32+neon, arm64+neon, powerpc+altivec... which is pretty much all hardware in which WASM currently runs, and I don't think that adding these would be a very controversial addition to the spec with the right motivation (in which domains and for what applications are they important, etc.).

On the other hand, when you asked the question we were talking explicitly about shuffling with run-time indices for all vectors but v8x16. You state that WASM should support these intrinsics because "some architectures support some of them", where "some" actually means that some of the intrinsics that you propose adding are not supported by any architecture while the rest is supported only by x86+AVX/AVX2/AVX-512.

That's a tiny fraction of the devices on which WASM runs, everywhere else they would need to be emulated, potentially failing to deliver reliable performance, complicating the code generation backend making WASM potentially slower to compile, etc.

You mention that some of these are supported in AVX-512 as if this were an argument to add them, but AVX-512 is a very controversial ISA often described as "horrible" (which might mean that the instructions it offers won't be offered by any other ISA), which is supported by basically zero hardware currently used to browse the internet, and even if were lucky to have WASM running on an AVX-512 machine, there is currently no consensus in the technical community about whether actually using AVX-512 at run-time is worth the trouble.

When you mention that WASM should have this or that instruction because it is available in AVX-512, I actually think that WASM shouldn't have it because it isn't available on AVX, SSE4.2, NEON, ALTIVEC, etc. In a nutshell, if an instruction is only available on AVX-512, I see that as a pretty strong argument against adding it to the ISA.


But if you provide only 1x v8x16, the user (/compiler) will need to emulate 2x v32x4 for WASM.

I don't see what's the issue with this. People don't often write WASM by hand, they write C, Rust, or some other higher level language, and they then run an optimizing compiler like LLVM that generates WASM, and which already has a framework for lowering vector shuffles to hardware.

I'd rather have these optimizing compiler do these optimizations than force the WASM compiler to become an optimizing compiler.

Well if the vector is a v8x16, and we have an immediate mode argument, or an optimizing machine code generator, then the best we can get is one "shuffle bytes" / "table lookup" instruction on x86 and arm.

Otherwise, it would need to copy the vector (or two vectors in the case of shuffles) back to memory, allocate another vector, perform a scalar shuffle into it, and copy the result back to a vector register.
That's pretty bad from the point of view of code size and performance if the "user" (compiler targetting WASM, user, ...) provides run-time indices instead of an immediate mode operand for whatever reason (optimizations failed, bug, user didn't know...).

Or the code they are writing cannot be written with immediates: table lookups are a good example.

A table lookup is just a shuffle with run-time indices. Note also, that I was talking about shuffling non-v8x16 vectors with run-time indices. The "user didn't know" refers to the precise case that you mention: a user wants to explicitly do something that cannot be written with immediates, like a table lookup, yet because they are shuffling non-v8x16 vectors they get potentially horrible cross-platform performance.

Adding an instruction to WASM doesn't fix this, just shifts the problem of generating efficient machine code to the code generator, which might not be an optimizing compiler. From the user POV, inspecting the assembly generated won't help them, because they will just see a single WASM instruction, which looks fast. I'd just rather have the optimizing compiler deal with this. Adding these to WASM appears to me to be unnecessary trouble, for little win.


With all that being said, I think it is crucial to have the extra element size variants:

FWIW, I have nothing against the extra element size variants, nor about adding shuffles with run-time indices for v8x16. I don't think these are controversial, and wish they would be proposed in a single, non-controversial PR, separated from everything else.

I have doubts about whether the value that the permutes add is worth it. I agree that they do add value (reduced code size), but this comes at the cost of increasing the ISA. Performance-wise, these should generate the same machine code as the shuffles on all target. Honestly, I'd just remove them. Once the uncontroversial parts are merged and implemented it will be easier to asses the code size concerns. If these turn to be significant, adding them might not be controversial, and you will have proof that adding them does solve real code size problems, rather than hypothetical ones.

I have serious doubts about shuffling / permutes with run-time indices for all vector types that are not v8x16 because cross-platform support for these is too limited. If you really want them it might make sense to split them in a different PR, so that the uncontroversial parts can make progress independently of what happens with these. They can always be added in a backwards compatible way in the version 1.x of the ISA, so this shouldn't really be a big deal.

@lemaitre
Copy link
Author

lemaitre commented Aug 7, 2018

Also, while it cannot be done in one instruction, any variant can be implemented from 1x v8x16.

Could you mention how this is done? More specifically, how can you do this without knowing the values of the indices at compile-time?

Here is one example to implement 1x v32x4 in SSSE3:

__m128i _mm_permutevar_epi32(__m128i a, __m128i b) {
    __m128i off  = _mm_set_epi8(3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0);
    __m128i shuffle  = _mm_set_epi8(12, 12, 12, 12, 8, 8, 8, 8, 4, 4, 4, 4, 0, 0, 0, 0);
    __m128i v = _mm_shuffle_epi8(b, shuffle);
    v = _mm_add_epi8(_mm_slli_epi32(v, 2), off);
    return _mm_shuffle_epi8(a, v);
}

This boils down to only 4 instructions.

On the other hand, when you asked the question we were talking explicitly about shuffling with run-time indices for all vectors but v8x16. You state that WASM should support these intrinsics because "some architectures support some of them", where "some" actually means that some of the intrinsics that you propose adding are not supported by any architecture while the rest is supported only by x86+AVX/AVX2/AVX-512.

That's a tiny fraction of the devices on which WASM runs, everywhere else they would need to be emulated, potentially failing to deliver reliable performance, complicating the code generation backend making WASM potentially slower to compile, etc.

No: AVX machines represent a big proportion of the desktop machines and supports 1x v32x4 and 1x v64x2.
AVX has been here since 2011 for x86 machines (Sandy Bridge for Intel and Bulldozer for AMD).
Those are not a minority. In fact, I'm quite confident they are now the majority of all x86 user machines (not taking into account company servers).

Moreover, AVX512 will get into customer machines (next year?) so is a target of interest.

Now, my point is: a big part of the machines will be able to take advantage of of the extra information carried by the extra sizes.
Let's not cripple them just because some architectures don't support those natively.

You mention that some of these are supported in AVX-512 as if this were an argument to add them, but AVX-512 is a very controversial ISA often described as "horrible" (which might mean that the instructions it offers won't be offered by any other ISA), which is supported by basically zero hardware currently used to browse the internet, and even if were lucky to have WASM running on an AVX-512 machine, there is currently no consensus in the technical community about whether actually using AVX-512 at run-time is worth the trouble.

While AVX512 is far from perfect, it is the best SIMD ISA from Intel. The only real controversy about AVX512 is: Is 512 bit-wide SIMD worth it?

But this does not concern us here as we are only on short SIMD (128 bits).
But, AVX512 has still many instructions for 128 bit wide SIMD that simplifies some stuff.

When you mention that WASM should have this or that instruction because it is available in AVX-512, I actually think that WASM shouldn't have it because it isn't available on AVX, SSE4.2, NEON, ALTIVEC, etc. In a nutshell, if an instruction is only available on AVX-512, I see that as a pretty strong argument against adding it to the ISA.

It is not some random instruction whose semantic is ultra specific or unclear.
It is just makes something you agreed was useful and extend it to be complete.

I would add: this is the kind of opinion that lead to SSE/AVX where some instructions are missing for some types, but do exist for other types. SSE, AVX (and AVX512 for that matter) have many inconsistencies and have holes in the ISA features.
Those inconsistencies make it painfully to use those ISAs.

That's what I want to avoid here.

And as I said, providing more generic instructions will allow a better translation on some common architectures because it carries more information.

But if you provide only 1x v8x16, the user (/compiler) will need to emulate 2x v32x4 for WASM.

I don't see what's the issue with this. People don't often write WASM by hand, they write C, Rust, or some other higher level language, and they then run an optimizing compiler like LLVM that generates WASM, and which already has a framework for lowering vector shuffles to hardware.

I'd rather have these optimizing compiler do these optimizations than force the WASM compiler to become an optimizing compiler.

But such a compiler cannot know on which architecture it will run, so it is better to carry more information that can be used by the WASM compiler.
This last step will not rely on heavy compilation: just translate the shuffle instruction into a predefined sequence doing the same action on the current platform.
It does not need to perform more optimizations with that design.

Adding an instruction to WASM doesn't fix this, just shifts the problem of generating efficient machine code to the code generator, which might not be an optimizing compiler. From the user POV, inspecting the assembly generated won't help them, because they will just see a single WASM instruction, which looks fast. I'd just rather have the optimizing compiler deal with this. Adding these to WASM appears to me to be unnecessary trouble, for little win.

The generation WASM->ASM will not be heavier for those instructions than for other as the optimal way to emulate it will be precomputed.

With all that being said, I think it is crucial to have the extra element size variants:

FWIW, I have nothing against the extra element size variants, nor about adding shuffles with run-time indices for v8x16. I don't think these are controversial, and wish they would be proposed in a single, non-controversial PR, separated from everything else.

This could be done, indeed.
I'll do it tomorrow.

I have doubts about whether the value that the permutes add is worth it. I agree that they do add value (reduced code size), but this comes at the cost of increasing the ISA. Performance-wise, these should generate the same machine code as the shuffles on all target. Honestly, I'd just remove them. Once the uncontroversial parts are merged and implemented it will be easier to asses the code size concerns. If these turn to be significant, adding them might not be controversial, and you will have proof that adding them does solve real code size problems, rather than hypothetical ones.

I have also my doubts on this subject, hence my question.
You seem to think they're not worth it currently, and I'm completely fine with that.

@lemaitre
Copy link
Author

lemaitre commented Aug 8, 2018

Ok, I removed the single input permute instructions, and the runtime indices shuffles except v8x16.shuffleVar.

Now I'm wondering. Many names have been used to designate these operations:
shuffle, permute, swizzle, blend, table lookup.
Is shuffle really the best name? Is shuffleVar a good name for runtime indices?

I have no strong opinion on the subject.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 8, 2018

Is shuffleVar a good name for runtime indices?

Good question. I like it. This is why.

shuffle, permute, swizzle, blend, table lookup.

I am not a native english speaker, but all of these sound like pretty much synonyms to me.

Ideally, we could just use shuffle, and add a second encoding of the instruction that accepts a non-immediate mode operand. That would make the ISA cleaner. I don't think this is a good idea when the cost of an instruction can vary a lot between immediate and non-immediate operands. Just give it a different name and call it a day. (e.g. AMD used the i suffix to denote the bit extract intrinsic bextri that accepted immediate mode arguments in the TBM instruction set).

So we could call the dynamic shuffle shuffle, and the shuffle with immediate shufflei. Whether we use shuffle and shuffleVar instead, I don't care much either. I just want this operations, with minimal performance footguns.

Calling the new instruction shuffleVar initially, sounds more than fair. I might not be able to realize what Var stands for initially, but that's something that I will discover when googling for the instruction.

proposals/simd/SIMD.md Outdated Show resolved Hide resolved
proposals/simd/SIMD.md Outdated Show resolved Hide resolved
@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 13, 2018

FWIW, Rust's packed_simd module ended up implementing the 1 argument variant of these only.

The name that got consensus there for these after some bikeshedding was shuffle_dyn1 (_dyn for dynamic - variable had confusing connotations in programming) reserving the possibility of introducing a two-argument shuffle_dyn in the future. That is, the set that Rust ended up implementing for 128-bit wide vectors is:

v8x16.shuffle_dyn1(a: v128, indices: v128) -> v128
v16x8.shuffle_dyn1(a: v128, indices: v128) -> v128
v32x4.shuffle_dyn1(a: v128, indices: v128) -> v128
v64x2.shuffle_dyn1(a: v128, indices: v128) -> v128

Rust provides efficient implementations of these for arm32+v7+neon, arm64+asimd, x86/x86_64 + SSSE3, x86/x86_64 + AVX. Efficient implementations for powerpc should also be possible. I don't know about MIPS and RISCV - we haven't really started those yet.

@lemaitre
Copy link
Author

I think it is much easier to write one input shuffle using two input shuffle than the opposite.
So I think it is better to first specify two input shuffles.

And for the WASM -> ASM, the pattern of implementing one input shuffle with two input shuffle is also easier to detect to generate the appropriate instruction.

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 13, 2018

I think it is much easier to write one input shuffle using two input shuffle than the opposite.

That is possibly true. Do you have or can point us to an implementation of the two input dynamic shuffle for the relevant vector types on the most common modern platforms (arm32, arm64, x86_64 SSE, and x86_64 AVX would suffice for me, but ppc64le would also be nice) ?

I'd like to check how hard are those to implement, and also the performance of implementing a one input shuffle version on top of the two input shuffle ones, instead of having tailored algorithms for one input shuffles directly.

My experience with the arm and x86 implementations gives me the feeling that the machine code generated for the two input element versions would need to be significantly different from the single element one.

@lemaitre
Copy link
Author

Ah sorry, I missed that you suggested one input dynamic shuffles.
Then, it's quite the opposite... (but it's still doable)

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 13, 2018

Ah yes, I was talking about one input dynamic shuffles! I guess we are on the same page!

I agree that implementing the two input version is definitely possible, we haven't done it yet because there has been little demand, and its mapping to hardware and its performance are not as straightforward to reason about as the single input version (multi-instruction sequences vs often just a single instruction).

Do you think it could make sense to also offer a single input dynamic shuffle instruction ? If so, maybe it might be worth it to start by adding the single instruction dynamic shuffle version, while keeping the door open for a two input dynamic shuffle version in the future. The single instruction version maps in a straightforward way to most hardware, which makes it uncontroversial, and delivers instant value.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 2, 2019

@lemaitre could you resolve the conflicts ?

@dtig would it be possible to review this? as proposed this is (1) super useful, and (2) easily implementable on most architectures.

@lemaitre
Copy link
Author

lemaitre commented Mar 2, 2019

@lemaitre could you resolve the conflicts ?

I've just done the merge from master.
Also, I changed the dynamic indices version from 2 inputs to 1 input as suggested previously.

Tell me if everything is correct now.

proposals/simd/SIMD.md Outdated Show resolved Hide resolved
@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 2, 2019

So this LGTM. Optionally it could also modify the binary encoding to include these, but maybe it is better to wait on @dtig review.

proposals/simd/SIMD.md Outdated Show resolved Hide resolved
@tlively
Copy link
Member

tlively commented Mar 6, 2019

I think the discussion here is missing justification for why these instructions should be added. We need to know specific classes of applications that would benefit and we need to have tentative performance and/or code size numbers that would show that including these instructions would be worth the extra implementation effort.

In particular, the variable-index permute seems very complex and non-portable based on discussions above. WebAssembly instructions are meant to map very simply to the underlying native instructions so that baseline compilers in engines do not need to do much work, so we actively do not want abstractions in the instruction set. What specific use cases do variable-index permutes address? Are there more portable formulations of it that would still be useful?

@zeux
Copy link
Contributor

zeux commented Mar 6, 2019

@tlively As one example, see comment #24 (comment). The algorithm implemented there needs v8x16.permute_dyn and isn't practical without it.

@dtig
Copy link
Member

dtig commented Mar 6, 2019

Would it be possible to get numbers that are specific to v8x16.permute_dyn? I guess what I'm looking for here is concrete data to be able to justify the complexity - this could even be just a native C++ comparison, i.e. a difference in performance on native using pshufb vs an emulated dynamic shuffle. The coarse grained data of with/without SIMD though helpful do not justify the addition of this particular set of permutations.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 6, 2019

There are three different things being proposed in this issue.

Encoding of shuffle immediates

@lemaitre has already hinted that it makes sense to them to split that discussion, and @dtig has also commented that they would find that useful, so let's assume that this will happen.

More shuffle instructions with immediate indices

These are useful, e.g., for adding all f32s in a f32x4 vector, and widely supported on all architectures (x64, arm, ppc, etc.), but they are not strictly necessary because one can just use i8x16.shuffle instead.

The question is IMO whether they are worth it. The two pros I can think of here is that they reduce binary size, and that they might simplify the WASM machine code generator, because instead of having to recognize that a particular i8x16.shuffle index sequence actually expresses a i32x4 shuffle, they can just lower it straightforwardly to optimal code.

I have no idea whether this is the case, but maybe @sunfishcode can chime in and comment about how easy it is for Cranelift to generate optimal code for i32x4 shuffles from i8x16 indices or similar? EDIT: and maybe also comment about what they think about these other shuffle variants.

Dynamic shuffles

The i8x16.permute_dyn is supported on all mainstream architectures (x64 SSE, NEON, PPC), and it cannot be emulated easily with static shuffles.

The question here seems to be whether this is useful / whether it allows to write faster programs.

The classical Fannkuch Redux benchmark (benchmarks game and paper) show that the two fastest implementations in C and C++ (1.5x faster than the 3rd place) actually use this intrinsic explicitly, which is called _mm_shuffle_epi8 and lowers to pshufb. On ARM it would just lower to vtbl (which stands for table-lookup, another useful thing that this can be applied to).

In Rust portable packed SIMD we call this shuffle1_dyn (shuffle 1-argument dynamic indices), and the Fannkuch Benchmark became 1.3-1.5 faster with it on x86 (https://github.com/rust-lang-nursery/packed_simd/blob/master/examples/fannkuch_redux/src/simd.rs#L59). @hsivonen might be able to run the benchmark on arm64 and report the results for the scalar and vectorized implementations, e.g., on Android. Note, however, that amr64 has even better support for look-up tables than arm32 and packed_simd exploits this.

EDIT: so IMO i8x16.permute_dyn is well supported across the board, useful, and worth adding. (This PR originally proposed adding more permute_dyn variants, but those were not very well supported so they were dropped).

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

Maybe it might be worth it to split this into those three issues, so that we can discuss and resolve them independently of each other, and so that reaching consensus on one of the issues is not blocked by the other two.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

Also, I just recalled another application of i8x16.permute_var, SIMD UTF-8 validation (e.g. see https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h#L35 - there is a blog post here https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/).

Basically, this instruction is useful whenever you need to, given some run-time indices, perform a table-look up. This happens often when encoding / decoding anything that's barely complex (UTF-8 , DNA, etc.).

@zeux
Copy link
Contributor

zeux commented Mar 7, 2019

permute_dyn is critical for any byte wise processing where the structure isn't fixed width - this is how my decoder uses it as well. I'll try to measure the performance of the C++ code with emulation using 16 scalar table fetches.

@penzn
Copy link
Contributor

penzn commented Mar 7, 2019

@gnzlbg, thank you for summarizing (and thank you @lemaitre for bringing this up). Those are significant, we should try to reach some conclusion on all three. Does either one of you want to post this as three separate proposals (issues)? I can do it myself tomorrow, but I think you can probably explain it better than I do.

@zeux
Copy link
Contributor

zeux commented Mar 7, 2019

Added more elaborate benchmark numbers targeting specifically the lack of dynamic permute: #24 (comment)

@penzn
Copy link
Contributor

penzn commented Mar 7, 2019

Filed #68, #69, and finally #70. To me this feels like something we can move post-MVP when we have more examples of WASM SIMD running in the wild.

proposals/simd/SIMD.md Outdated Show resolved Hide resolved
@lemaitre
Copy link
Author

As it seems the community as reached a consensus on other PR/issues, I updated my PR to take into account all the changes.

@lemaitre lemaitre changed the title Shuffle and permute specification Shuffle with immediate indices specification Mar 31, 2019
@baryluk
Copy link

baryluk commented Oct 31, 2019

This PR would also help with targeting ARM SVE, to map into one of the permute with immediate (DUP, EXT, INSR, REV, REVB, REVH, REVW, SUNPKHI, SUNPKLO, TRN1, TRN2, UUNPKHI, UUNPKLO, UZP1, UZP2, ZIP1, ZIP2), copy/broadcasts (CPY, DUP, FCPY, FDUP, SEL) and emulate the rest using permute with extra vector or predicate register (COMPACT, SPLICE or TBL).

"ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A"
https://developer.arm.com//docs/ddi0584/ad/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
https://developer.arm.com/-/media/developer/products/architecture/DDI0584A_d_SVE.zip

https://static.docs.arm.com/ddi0584/ae/DDI0584A_e_SVE_supp_armv8A.pdf

https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_00_en.pdf

https://static.docs.arm.com/101726/0100/porting_and_optimizing_hpc_applications_for_arm_sve_101726_0100_00_en.pdf

For details.

@tlively
Copy link
Member

tlively commented Feb 2, 2021

Closing since we have consensus on our current strategy for shuffles.

@tlively tlively closed this Feb 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants