-
Notifications
You must be signed in to change notification settings - Fork 0
Permutations, Shuffles, Packs, etc.
There are many instructions that rearrange data. In general, we can think of this as rearranging vector elements of POT (Power of Two) sizes. Where the vectors might be RISC-V-like, or where they may be SIMD packed vectors, with N bits divided into NAPOT (Naturally Aligned Power of Two) bitfields, where in a separate SIMF RF, or whether interpreted n-buit scalars as ASIMD bitvectors.
In general
rd.n[i] := rs1.n[ permutation_vector[i] ], i=0..XLEN/n - 1
or the version allows permutations across 2 operands, e,.g. to handled packed data structures that straddle boundaries
rd.n[i] := concat(rs1,rs2).n[ permutation_vector[i] ], i=0..XLEN/n - 1
but the permutation vector elements must be of size log2(XLEN/n) (+1 if straddle 2 inoputs)
this is fine for a RISC-V-like ISA where the vector operands should have the same number of elements but not the same element width (although at the moment RISC-V doers not have many mixed width input operands AFAIK)
but it does not work so well for a SIMD packed vector ISA, where you want the same number of bits in all operands
rd.n[i] := rs1.n[ rs2.n[i] ], i=0..XLEN/n - 1
Moreover, even the generalized permutation for restricted inoput widths is expensive. Many uarchs do not want to implement.
An even bigger generalization are BMM - bit matrix multiplies, whether BMMM (bit matrix multiply), or smaller BMVM (bit matrix matrix multiply) (I will looselty say that BMM = BMVM and BMMM) E.g. with AND as the multipliation operator, and OR as the addition operator).
BMMs are not so much a generation of permutations, as it is an operation of wider utility - BMM-AND-XOR is used in many information theoretic analyses, while BMM-XOR-POPCOUNT can be considered a as computing a vector of Huffman differences, used in pattern matching. Either multiple patterns against a single input, or multiple inputs against a single pattern. (Eventually a min/max reduction will be be neededm although often there will be more BMM-XOR-POPCOUNTs with vector-vector ADDs, for patterns and inpurts larger than the dimension N(N-bit veftors, NxN bit matrices).
However, BMMs, specifically BMVM (BMVM-AND-OR or BMVM-XOR-OR or BMVM-XOR-XOR), can be used to perform an arbitrary permutation of the bits of the vector operand, assuming only 1 bitin every column contributing to any bit of the output (Q: what's tghe math term for that?). Assuming, of course, that if the vector is n-bits wide, the matrix is NxB nits wide. Unfortunately, this is not practical for scalar instruction sets - for XLEN=64, it would require 64 registers to hold the bit matrix.
Typically then a full BMM is divided into smaller parts. E.g. a BMVM matrix-vector, for a vector whose elements are N-64 bits, would need 64x64 = 2^12 = 4K bits of "vector" considered as the matrix to perform BMVM on each veector element - arbitrary bit matrix, but the same bit matrix for every 64-bit vector element. This would set a minimum width on the vector (VLEN=4K in RISC-V parlance, or the same width obtaine via a smaller VLEN with LMUL>1 (length multiplier)).
Cray's old vector machines implemwented such a BMM, IIRC with a 64x64 operand. The BMM operand was stored in a specia functional unit, hopefully not changed very often.
Cray's newer vector machines, dertived from the Tera Horizon, have even smaller BMM units - e.g. treatibng both operands as vectors of 8x8 bit matrixes. Larger BMMs can be composed out of thiese.
RISC-V's B-extension has done something similar - it's RV64 bmat[x]or instructions treat the operands as 8x8 bit matrices.
8x8 BMMs can be considered as doing an arbitraty permutatation of 8-bit bytes with a 64-but operad, or the same arbitrary permition of bits within each byte of the 64-bit operand.
Moreover, even the generalized permutation for restricted inoput widths is expensive. Many uarchs do not want to implement.
So there has evolved a cottage industry of defining certain restricted permutations, and building generalized permutations out of them:
perfect shuffles, generalized reversals, etc.
TBD: collect
TBD:
Crypto S-boxes
PRESENT's pLayer permutation - this is what inspired starting this page