Skip to content

vector mixed width operations and lane crossing

AndyGlew edited this page Aug 3, 2020 · 1 revision

The problem with 2X widening / width doubling doing

vd.2N[i] := f( vs1.N[i] )

e.g.

vd.128[i] := vs1.64[i] * vs2.64[i]

is that it often leads to crossing physical lanes. 64 is a pretty common physical lane boundary, although I was very happy to have 128 at MIPS

but it's not just that. if you think about a single physical register

preg.128[i] = preg[ 128i : 128i + 127 ]

preg.64[i] = preg[ 64i : 64i + 63 ]

and these values are not in the same 128 bit lane when i > 1 !!! :-(

(BTW, this is the SIMD mindset, which IMHO is much more naturally suited to a data path than the RISC-V approach if naïvely implemented )

Moreover, for conventional fixed number of bits vector instruction sets, the widened inversion takes twice as many registers.

which leads to things like ARM's upper and lower widening instructions

LOWER HALF: vd.2N[i] := f( vs1.N[i] ), i := 0..VLEN/N/2-1

UPPER HALF: vd.2N[i] := f( vs1.N[i + N/2] ), i := 0..VLEN/N/2-1

This still has lane crossing problems, but at least it only writes one register. RISC-V hides this problem under the covers of LMUL.

If you want to avoid lane crossings, you really need to do something like

ODD HALF: vd.2N[i] := f( vs1.N[2*i] ), i := 0..VLEN/2N-1

EVEN HALF: vd.2N[i] := f( vs1.N[2*i+1] ), i := 0..VLEN/2N-1

and better if you can use all of the data in a single pass

MULADD.EE+OO: vd.2N[i] += vs1.N[2i] * vs2.N[2i] + vs1.N[2i+1] * vs2.N[2i+1]

and also

MULADD.EO+OE vd.2N[i] += vs1.N[2i] * vs2.N[2i+1] + vs1.N[2i+1] * vs2.N[2i]

Similarly for 4X widening, as you might use for multiply add if you weren't doing CLMUL or redundant form for extended precision.

people don't necessarily like this sort of interleaved odd/even or quadrant based operations for widening. They'd much rather have lane crossing. but it makes a really big difference to hardware complexity, Or spatially parallel vector data paths. The lower/upper half widening versions are appropriate for a temporal vector data path. but I don't think anybody has built a strictly temporal data path since the Cray-1. nearly everybody has some spatial aspect to the data path. and that spatial aspect is nearly always a fixed number of bits, VLEN. Nvidia GPUs currently have temporal = 2, VLEN= 16 threads/warp * 32 bits/thread * 1 warp/2 cycles = 512 bits. Older AMD GPUs had temporal factor = 4 cycles/wavefront.

I try to keep a few target implementations in my head for RISC-V

the ones I mostly care about VLEN=128 bits, lane size = 128 VLEN equal 512, lane size equals 128

lower end VLEN=128, lane = 32

really lower end ( not clear what you really want to use a vector instruction set) VLEN=32, lane=32

Clone this wiki locally