-
Notifications
You must be signed in to change notification settings - Fork 43
Shuffle with immediate indices specification #30
Conversation
reductions are computed with permutes
In general the way we've approached SIMD instructions is to see what kind of code would benefit from these opcodes, and how. Can you document in this PR:
|
I want first to mention that the position of WASM is a bit different than regular ISA:
Basically every shuffling operations supported by an architecture:
Neon:
If somebody wants for Altivec/VSX, I can also do it.
I will answer with low level algorithms.
Those instructions cannot be emulated efficiently without any form of shuffling. That being said, if you compare with the shuffle instruction that was there before this MR ( Now, I expect the redundancy to help the translation WASM->ASM with the pattern recognition.
Compared to the situation without any shuffling operation, it will be a huge gain because we don't need scalar emulation for these. Now, the comparison with only
This table makes it pretty clear that the However, I have no idea how big will be the difference on a complete program. I hope this would allow some discussions before some deep analysis. |
On x86 you can call
WASM to machine code compilers are not necessarily optimizing compilers. So I don't think this can be avoided because that would introduce a pretty big performance cliff when the immediate mode operand is not correctly identified as such (more on this below).
AFAIK x86/x86_64+SSSE3, arm32/64+neon, ... only support shuffling bytes with run-time indices. The moment you want to shuffle v16x8, v32x4, v64x2, ... you are out-of-luck. So why add these shuffle instructions without immediate mode arguments for all vector types when no ISA really supports them ? Also, what is the best code that a machine code generator could generate from such shuffles for vectors in general? Well if the vector is a v8x16, and we have an immediate mode argument, or an optimizing machine code generator, then the best we can get is one "shuffle bytes" / "table lookup" instruction on x86 and arm. Otherwise, it would need to copy the vector (or two vectors in the case of shuffles) back to memory, allocate another vector, perform a scalar shuffle into it, and copy the result back to a vector register. That's pretty bad from the point of view of code size and performance if the "user" (compiler targetting WASM, user, ...) provides run-time indices instead of an immediate mode operand for whatever reason (optimizations failed, bug, user didn't know...). All in all, I think that:
|
WASM is a stack ISA: you cannot access individual registers. So an instruction taking 2 arguments needs to pop 2 registers from the stack.
That was also my feeling, but I think the question needed to be asked.
That's not true. While I agree implementations lack some of them, support for those are better and better. Also, while it cannot be done in one instruction, any variant can be implemented from 1x v8x16. My point is: why penalize everybody by having a single one when some architectures support some of them and would benefit from them?
You don't have to go back to scalar to emulate all the others. The idea would be: if a user needs 1x v8x16 fine, they can use it. But if you provide only 1x v8x16, the user (/compiler) will need to emulate 2x v32x4 for WASM.
Or the code they are writing cannot be written with immediates: table lookups are a good example.
Well, the overhead of 2 bytes to duplicate the top register to emulate a permute with a shuffle is probably fine.
Well, AVX512 supports almost all variants of run-time indices. Only the 16 bits elements are not supported. With all that being said, I think it is crucial to have the extra element size variants:
|
Could you mention how this is done? More specifically, how can you do this without knowing the values of the indices at compile-time?
Currently, WASM has no vector ISA, only a proposed one. This vector ISA is conservative. It only supports 128-bit wide vectors, it doesn't provide horizontal reductions, etc. Why? Because WASM is shipped over the internet to unknown machines, where it has to be compiled and run blazingly fast reliably so that a webpage can render instantaneously. The most common hardware were WASM runs is x86 desktops, and the billions of arm Android, iOS, ... devices. Shuffling v8x16s with run-time indices is supported on SSSE3, arm32+neon, arm64+neon, powerpc+altivec... which is pretty much all hardware in which WASM currently runs, and I don't think that adding these would be a very controversial addition to the spec with the right motivation (in which domains and for what applications are they important, etc.). On the other hand, when you asked the question we were talking explicitly about shuffling with run-time indices for all vectors but That's a tiny fraction of the devices on which WASM runs, everywhere else they would need to be emulated, potentially failing to deliver reliable performance, complicating the code generation backend making WASM potentially slower to compile, etc. You mention that some of these are supported in AVX-512 as if this were an argument to add them, but AVX-512 is a very controversial ISA often described as "horrible" (which might mean that the instructions it offers won't be offered by any other ISA), which is supported by basically zero hardware currently used to browse the internet, and even if were lucky to have WASM running on an AVX-512 machine, there is currently no consensus in the technical community about whether actually using AVX-512 at run-time is worth the trouble. When you mention that WASM should have this or that instruction because it is available in AVX-512, I actually think that WASM shouldn't have it because it isn't available on AVX, SSE4.2, NEON, ALTIVEC, etc. In a nutshell, if an instruction is only available on AVX-512, I see that as a pretty strong argument against adding it to the ISA.
I don't see what's the issue with this. People don't often write WASM by hand, they write C, Rust, or some other higher level language, and they then run an optimizing compiler like LLVM that generates WASM, and which already has a framework for lowering vector shuffles to hardware. I'd rather have these optimizing compiler do these optimizations than force the WASM compiler to become an optimizing compiler.
A table lookup is just a shuffle with run-time indices. Note also, that I was talking about shuffling non- Adding an instruction to WASM doesn't fix this, just shifts the problem of generating efficient machine code to the code generator, which might not be an optimizing compiler. From the user POV, inspecting the assembly generated won't help them, because they will just see a single WASM instruction, which looks fast. I'd just rather have the optimizing compiler deal with this. Adding these to WASM appears to me to be unnecessary trouble, for little win.
FWIW, I have nothing against the extra element size variants, nor about adding shuffles with run-time indices for I have doubts about whether the value that the permutes add is worth it. I agree that they do add value (reduced code size), but this comes at the cost of increasing the ISA. Performance-wise, these should generate the same machine code as the shuffles on all target. Honestly, I'd just remove them. Once the uncontroversial parts are merged and implemented it will be easier to asses the code size concerns. If these turn to be significant, adding them might not be controversial, and you will have proof that adding them does solve real code size problems, rather than hypothetical ones. I have serious doubts about shuffling / permutes with run-time indices for all vector types that are not |
Here is one example to implement 1x v32x4 in SSSE3: __m128i _mm_permutevar_epi32(__m128i a, __m128i b) {
__m128i off = _mm_set_epi8(3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0);
__m128i shuffle = _mm_set_epi8(12, 12, 12, 12, 8, 8, 8, 8, 4, 4, 4, 4, 0, 0, 0, 0);
__m128i v = _mm_shuffle_epi8(b, shuffle);
v = _mm_add_epi8(_mm_slli_epi32(v, 2), off);
return _mm_shuffle_epi8(a, v);
} This boils down to only 4 instructions.
No: AVX machines represent a big proportion of the desktop machines and supports 1x v32x4 and 1x v64x2. Moreover, AVX512 will get into customer machines (next year?) so is a target of interest. Now, my point is: a big part of the machines will be able to take advantage of of the extra information carried by the extra sizes.
While AVX512 is far from perfect, it is the best SIMD ISA from Intel. The only real controversy about AVX512 is: Is 512 bit-wide SIMD worth it? But this does not concern us here as we are only on short SIMD (128 bits).
It is not some random instruction whose semantic is ultra specific or unclear. I would add: this is the kind of opinion that lead to SSE/AVX where some instructions are missing for some types, but do exist for other types. SSE, AVX (and AVX512 for that matter) have many inconsistencies and have holes in the ISA features. That's what I want to avoid here. And as I said, providing more generic instructions will allow a better translation on some common architectures because it carries more information.
But such a compiler cannot know on which architecture it will run, so it is better to carry more information that can be used by the WASM compiler.
The generation WASM->ASM will not be heavier for those instructions than for other as the optimal way to emulate it will be precomputed.
This could be done, indeed.
I have also my doubts on this subject, hence my question. |
Ok, I removed the single input permute instructions, and the runtime indices shuffles except Now I'm wondering. Many names have been used to designate these operations: I have no strong opinion on the subject. |
Good question. I like it. This is why.
I am not a native english speaker, but all of these sound like pretty much synonyms to me. Ideally, we could just use So we could call the dynamic shuffle Calling the new instruction |
FWIW, Rust's The name that got consensus there for these after some bikeshedding was
Rust provides efficient implementations of these for arm32+v7+neon, arm64+asimd, x86/x86_64 + SSSE3, x86/x86_64 + AVX. Efficient implementations for powerpc should also be possible. I don't know about MIPS and RISCV - we haven't really started those yet. |
I think it is much easier to write one input shuffle using two input shuffle than the opposite. And for the WASM -> ASM, the pattern of implementing one input shuffle with two input shuffle is also easier to detect to generate the appropriate instruction. |
That is possibly true. Do you have or can point us to an implementation of the two input dynamic shuffle for the relevant vector types on the most common modern platforms (arm32, arm64, x86_64 SSE, and x86_64 AVX would suffice for me, but ppc64le would also be nice) ? I'd like to check how hard are those to implement, and also the performance of implementing a one input shuffle version on top of the two input shuffle ones, instead of having tailored algorithms for one input shuffles directly. My experience with the arm and x86 implementations gives me the feeling that the machine code generated for the two input element versions would need to be significantly different from the single element one. |
Ah sorry, I missed that you suggested one input dynamic shuffles. |
Ah yes, I was talking about one input dynamic shuffles! I guess we are on the same page! I agree that implementing the two input version is definitely possible, we haven't done it yet because there has been little demand, and its mapping to hardware and its performance are not as straightforward to reason about as the single input version (multi-instruction sequences vs often just a single instruction). Do you think it could make sense to also offer a single input dynamic shuffle instruction ? If so, maybe it might be worth it to start by adding the single instruction dynamic shuffle version, while keeping the door open for a two input dynamic shuffle version in the future. The single instruction version maps in a straightforward way to most hardware, which makes it uncontroversial, and delivers instant value. |
I've just done the merge from master. Tell me if everything is correct now. |
So this LGTM. Optionally it could also modify the binary encoding to include these, but maybe it is better to wait on @dtig review. |
I think the discussion here is missing justification for why these instructions should be added. We need to know specific classes of applications that would benefit and we need to have tentative performance and/or code size numbers that would show that including these instructions would be worth the extra implementation effort. In particular, the variable-index permute seems very complex and non-portable based on discussions above. WebAssembly instructions are meant to map very simply to the underlying native instructions so that baseline compilers in engines do not need to do much work, so we actively do not want abstractions in the instruction set. What specific use cases do variable-index permutes address? Are there more portable formulations of it that would still be useful? |
@tlively As one example, see comment #24 (comment). The algorithm implemented there needs |
Would it be possible to get numbers that are specific to |
There are three different things being proposed in this issue. Encoding of shuffle immediates@lemaitre has already hinted that it makes sense to them to split that discussion, and @dtig has also commented that they would find that useful, so let's assume that this will happen. More shuffle instructions with immediate indicesThese are useful, e.g., for adding all The question is IMO whether they are worth it. The two pros I can think of here is that they reduce binary size, and that they might simplify the WASM machine code generator, because instead of having to recognize that a particular I have no idea whether this is the case, but maybe @sunfishcode can chime in and comment about how easy it is for Cranelift to generate optimal code for Dynamic shufflesThe The question here seems to be whether this is useful / whether it allows to write faster programs. The classical Fannkuch Redux benchmark (benchmarks game and paper) show that the two fastest implementations in C and C++ (1.5x faster than the 3rd place) actually use this intrinsic explicitly, which is called In Rust portable packed SIMD we call this EDIT: so IMO |
Maybe it might be worth it to split this into those three issues, so that we can discuss and resolve them independently of each other, and so that reaching consensus on one of the issues is not blocked by the other two. |
Also, I just recalled another application of Basically, this instruction is useful whenever you need to, given some run-time indices, perform a table-look up. This happens often when encoding / decoding anything that's barely complex (UTF-8 , DNA, etc.). |
|
@gnzlbg, thank you for summarizing (and thank you @lemaitre for bringing this up). Those are significant, we should try to reach some conclusion on all three. Does either one of you want to post this as three separate proposals (issues)? I can do it myself tomorrow, but I think you can probably explain it better than I do. |
Added more elaborate benchmark numbers targeting specifically the lack of dynamic permute: #24 (comment) |
As it seems the community as reached a consensus on other PR/issues, I updated my PR to take into account all the changes. |
This PR would also help with targeting ARM SVE, to map into one of the permute with immediate (DUP, EXT, INSR, REV, REVB, REVH, REVW, SUNPKHI, SUNPKLO, TRN1, TRN2, UUNPKHI, UUNPKLO, UZP1, UZP2, ZIP1, ZIP2), copy/broadcasts (CPY, DUP, FCPY, FDUP, SEL) and emulate the rest using permute with extra vector or predicate register (COMPACT, SPLICE or TBL). "ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A" https://static.docs.arm.com/ddi0584/ae/DDI0584A_e_SVE_supp_armv8A.pdf https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_00_en.pdf For details. |
Closing since we have consensus on our current strategy for shuffles. |
This a PR for shuffling instructions with immediate indices.
It aims to add back the instructions
v16x8.suffle2_imm
,v32x4.shuffle2_imm
,v64x2.shuffle2_imm
into WebAssembly.These instructions enables better and simpler pattern matching from the WASM->ASM virtual machine to assure best performance.
Indeed, while
v8x16.shuffle_imm2
can be used to emulate all the others, it is tedious to recognize the shuffling rules to use the proper instruction of the target platform.EDITED: Old PR description
Hello everyone,I created a small specification for shuffle and permute operations.
My intent with this merge request is mainly to have a discussion.
Few general questions:
More specialized questions:
permute
required to avoid input duplication?Any comment on this is welcome.