From 7f4d54d4c9a525f6e04c7edcecdc1969f27e5d09 Mon Sep 17 00:00:00 2001
From: Arseny Kapoulkine <arseny.kapoulkine@gmail.com>
Date: Wed, 27 Mar 2019 10:10:01 -0700
Subject: [PATCH] Add v8x16.shuffle1 instruction (#71)

This change adds a variable shuffle instruction to SIMD proposal.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

To distinguish variable shuffles with immediate shuffles, existing
v8x16.shuffle instruction is renamed to v8x16.shuffle2_imm to be
explicit about the fact that it shuffles two vectors with an immediate
argument.

This naming scheme allows for adding variants like v8x16.shuffle2 and
v8x16.shuffle1_imm in the future.

Fixes #68.
Contributes to #24.
Fixes #11.
---
 proposals/simd/BinarySIMD.md |  5 +++--
 proposals/simd/SIMD.md       | 25 ++++++++++++++++++++++---
 proposals/simd/TextSIMD.md   |  4 ++--
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/proposals/simd/BinarySIMD.md b/proposals/simd/BinarySIMD.md
index 08dadda339..ff51a7f0f1 100644
--- a/proposals/simd/BinarySIMD.md
+++ b/proposals/simd/BinarySIMD.md
@@ -23,14 +23,13 @@ instr ::= ...
 ```
 
 Some SIMD instructions have additional immediate operands following `simdop`.
-The `v8x16.shuffle` instruction has 16 bytes after `simdop`.
+The `v8x16.shuffle2_imm` instruction has 16 bytes after `simdop`.
 
 | Instruction               | `simdop` | Immediate operands |
 | --------------------------|---------:|--------------------|
 | `v128.load`               |    `0x00`| m:memarg           |
 | `v128.store`              |    `0x01`| m:memarg           |
 | `v128.const`              |    `0x02`| i:ImmByte[16]      |
-| `v8x16.shuffle`           |    `0x03`| s:LaneIdx32[16]    |
 | `i8x16.splat`             |    `0x04`| -                  |
 | `i8x16.extract_lane_s`    |    `0x05`| i:LaneIdx16        |
 | `i8x16.extract_lane_u`    |    `0x06`| i:LaneIdx16        |
@@ -167,3 +166,5 @@ The `v8x16.shuffle` instruction has 16 bytes after `simdop`.
 | `f32x4.convert_u/i32x4`   |    `0xb0`| -                  |
 | `f64x2.convert_s/i64x2`   |    `0xb1`| -                  |
 | `f64x2.convert_u/i64x2`   |    `0xb2`| -                  |
+| `v8x16.shuffle1`          |    `0xc0`| -                  |
+| `v8x16.shuffle2_imm`      |    `0xc1`| s:LaneIdx32[16]    |
\ No newline at end of file
diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index bb8d1d4a75..4862364e8e 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -284,8 +284,8 @@ def S.replace_lane(a, i, x):
 The input lane value, `x`, is interpreted the same way as for the splat
 instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored.
 
-### Shuffle lanes
-* `v8x16.shuffle(a: v128, b: v128, imm: ImmLaneIdx32[16]) -> v128`
+### Shuffling using immediate indices
+* `v8x16.shuffle2_imm(a: v128, b: v128, imm: ImmLaneIdx32[16]) -> v128`
 
 Returns a new vector with lanes selected from the lanes of the two input vectors
 `a` and `b` specified in the 16 byte wide immediate mode operand `imm`. This
@@ -294,7 +294,7 @@ return. The indices `i` in range `[0, 15]` select the `i`-th element of `a`. The
 indices in range `[16, 31]` select the `i - 16`-th element of `b`.
 
 ```python
-def S.shuffle(a, b, s):
+def S.shuffle2_imm(a, b, s):
     result = S.New()
     for i in range(S.Lanes):
         if s[i] < S.lanes:
@@ -304,6 +304,25 @@ def S.shuffle(a, b, s):
     return result
 ```
 
+### Shuffling using variable indices
+* `v8x16.shuffle1(a: v128, s: v128) -> v128`
+
+Returns a new vector with lanes selected from the lanes of the first input
+vector `a` specified in the second input vector `s`. The indices `i` in range
+`[0, 15]` select the `i`-th element of `a`. For indices outside of the range
+the resulting lane is 0.
+
+```python
+def S.shuffle1(a, s):
+    result = S.New()
+    for i in range(S.Lanes):
+        if s[i] < S.lanes:
+            result[i] = a[s[i]]
+        else:
+            result[i] = 0
+    return result
+```
+
 ## Integer arithmetic
 
 Wrapping integer arithmetic discards the high bits of the result.
diff --git a/proposals/simd/TextSIMD.md b/proposals/simd/TextSIMD.md
index 8ba2e4a7b6..fc3a7e7d2c 100644
--- a/proposals/simd/TextSIMD.md
+++ b/proposals/simd/TextSIMD.md
@@ -20,8 +20,8 @@ The canonical text format used for printing `v128.const` instructions is
 v128.const i32x4 0xNNNNNNNN 0xNNNNNNNN 0xNNNNNNNN 0xNNNNNNNN
 ```
 
-### v8x16.shuffle
+### v8x16.shuffle2_imm
 
 ```
-v8x16.shuffle i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5
+v8x16.shuffle2_imm i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5
 ```