Skip to content

Permutations, Shuffles, Packs, etc.

AndyGlew edited this page Sep 10, 2020 · 1 revision

Wanted: arbitrary permutations of NAPOT data

There are many instructions that rearrange data. In general, we can think of this as rearranging vector elements of POT (Power of Two) sizes. Where the vectors might be RISC-V-like, or where they may be SIMD packed vectors, with N bits divided into NAPOT (Naturally Aligned Power of Two) bitfields, where in a separate SIMF RF, or whether interpreted n-buit scalars as ASIMD bitvectors.

In general

rd.n[i] := rs1.n[ permutation_vector[i] ],  i=0..XLEN/n - 1

or the version allows permutations across 2 operands, e,.g. to handled packed data structures that straddle boundaries

rd.n[i] := concat(rs1,rs2).n[ permutation_vector[i] ],  i=0..XLEN/n - 1

but the permutation vector elements must be of size log2(XLEN/n) (+1 if straddle 2 inoputs)

this is fine for a RISC-V-like ISA where the vector operands should have the same number of elements but not the same element width (although at the moment RISC-V doers not have many mixed width input operands AFAIK)

but it does not work so well for a SIMD packed vector ISA, where you want the same number of bits in all operands

rd.n[i] := rs1.n[ rs2.n[i] ],  i=0..XLEN/n - 1

Moreover, even the generalized permutation for restricted inoput widths is expensive. Many uarchs do not want to implement.

Bit Matrix Multiplications

An even bigger generalization are BMM - bit matrix multiplies, whether BMMM (bit matrix multiply), or smaller BMVM (bit matrix matrix multiply) (I will looselty say that BMM = BMVM and BMMM) E.g. with AND as the multipliation operator, and OR as the addition operator).

BMMs are not so much a generation of permutations, as it is an operation of wider utility - BMM-AND-XOR is used in many information theoretic analyses, while BMM-XOR-POPCOUNT can be considered a as computing a vector of Huffman differences, used in pattern matching. Either multiple patterns against a single input, or multiple inputs against a single pattern. (Eventually a min/max reduction will be be neededm although often there will be more BMM-XOR-POPCOUNTs with vector-vector ADDs, for patterns and inpurts larger than the dimension N(N-bit veftors, NxN bit matrices).

However, BMMs, specifically BMVM (BMVM-AND-OR or BMVM-XOR-OR or BMVM-XOR-XOR), can be used to perform an arbitrary permutation of the bits of the vector operand, assuming only 1 bitin every column contributing to any bit of the output (Q: what's tghe math term for that?). Assuming, of course, that if the vector is n-bits wide, the matrix is NxB nits wide. Unfortunately, this is not practical for scalar instruction sets - for XLEN=64, it would require 64 registers to hold the bit matrix.

Typically then a full BMM is divided into smaller parts. E.g. a BMVM matrix-vector, for a vector whose elements are N-64 bits, would need 64x64 = 2^12 = 4K bits of "vector" considered as the matrix to perform BMVM on each veector element - arbitrary bit matrix, but the same bit matrix for every 64-bit vector element. This would set a minimum width on the vector (VLEN=4K in RISC-V parlance, or the same width obtaine via a smaller VLEN with LMUL>1 (length multiplier)).

Cray's old vector machines implemwented such a BMM, IIRC with a 64x64 operand. The BMM operand was stored in a specia functional unit, hopefully not changed very often.

Cray's newer vector machines, dertived from the Tera Horizon, have even smaller BMM units - e.g. treatibng both operands as vectors of 8x8 bit matrixes. Larger BMMs can be composed out of thiese.

RISC-V's B-extension has done something similar - it's RV64 bmat[x]or instructions treat the operands as 8x8 bit matrices.

8x8 BMMs can be considered as doing an arbitraty permutatation of 8-bit bytes with a 64-but operad, or the same arbitrary permition of bits within each byte of the 64-bit operand.

Restricted permutations

Moreover, even the generalized permutation for restricted inoput widths is expensive. Many uarchs do not want to implement.

So there has evolved a cottage industry of defining certain restricted permutations, and building generalized permutations out of them:

perfect shuffles, generalized reversals, etc.

TBD: collect

Paericularly important permutations

TBD:

Crypto S-boxes

PRESENT's pLayer permutation - this is what inspired starting this page

Clone this wiki locally