Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Shuffle with immediate indices specification #30

Closed
wants to merge 11 commits into from
36 changes: 35 additions & 1 deletion proposals/simd/SIMD.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,10 +211,14 @@ The input lane value, `x`, is interpreted the same way as for the splat
instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored.

### Shuffle lanes

#### Immediate permutation rule
* `v8x16.shuffle(a: v128, b: v128, s: LaneIdx32[16]) -> v128`
* `v16x8.shuffle(a: v128, b: v128, s: LaneIdx16[8]) -> v128`
* `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128`
* `v64x2.shuffle(a: v128, b: v128, s: LaneIdx4[2]) -> v128`

Create vector with lanes selected from the lanes of two input vectors:

```python
def S.shuffle(a, b, s):
result = S.New()
Expand All @@ -226,6 +230,11 @@ def S.shuffle(a, b, s):
return result
```

#### Variable permutation rule
* `v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128`
lemaitre marked this conversation as resolved.
Show resolved Hide resolved

Same as non-`Var`, but where indices are runtime values.

lemaitre marked this conversation as resolved.
Show resolved Hide resolved
## Integer arithmetic

Wrapping integer arithmetic discards the high bits of the result.
Expand Down Expand Up @@ -675,3 +684,28 @@ Lane-wise saturating conversion from floating point to integer using the IEEE
resulting lane is 0. If the rounded integer value of a lane is outside the
range of the destination type, the result is saturated to the nearest
representable integer value.


## Reductions

There is no instruction for reductions.
Instead, one can use permutations to reduce lane-wise operations like `add`, `min`, `max`, `and`, `or`...

Here is an example to reduce add on f32x4:
```
get_local 0
v64x2.permute 1 0 ;; swap the lower part with the higher part of the vector
lemaitre marked this conversation as resolved.
Show resolved Hide resolved
f32x4.add
get_local 0
v32x4.permute 1 0 3 2 ;; swap the 2 first elements together, and the 2 last elements together
f32x4.add
f32x4.extract_lane 0 ;; extract the first element
```

Here is an example to reduce add on f64x2:
```
get_local 0
v64x2.permute 1 0 ;; swap the lower part with the higher part of the vector
f64x2.add
f64x2.extract_lane 0 ;; extract the first element
```