[RFC] Dynamic shuffle #68

penzn · 2019-03-07T17:17:31Z

Background in #30 by @lemaitre, also part of #24.

I would like to narrow this issue down to just i8x16.permute_dyn instruction proposed in #30, which would shuffle contents of one vector using lanes of the second vector as indices.

@zeux and @gnzlbg point out that dynamic permute is needed for bytewise processing on structures of dynamic width. Major hardware vendors support some form of dynamic shuffles.

From proposed changes in #30:

v8x16.permute_dyn(a: v128, s: v128) -> v128

Returns a new vector with lanes selected from the lanes of the first input vector
a and specified in the second input vector s. The indices from s are first
fit into the range [0, 15] via a modulo.

def S.permute_dyn(a, s):
    result = S.New()
    for i in range(S.Lanes):
      result[i] = a[s[i] % S.Lanes]
    return result

The text was updated successfully, but these errors were encountered:

gnzlbg · 2019-03-07T17:42:51Z

To the best of my knowledge, the following hardware natively support this operation as a single instruction: $arch: $ISA($instr,*),*

x86_64: SSSE3 (pshufb), AVX2 (vpshufb), AVX-512 (vpshufb)
arm: NEON ({v}tb{l,x})
ppc64: altivec (vperm), vsx (xxperm)
mips: msa (vshf)
riscv: V(vrgather)

tlively · 2019-03-07T21:58:28Z

From previous conversation it seems to me that this would be easy to implement both in engines and in tooling and can lead to clear performance wins for some applications. I personally see no problem with including this in MVP.

zeux · 2019-03-08T04:27:47Z

Just in case I want to address the possible options for handling out-of-range indices.

Here's how each architecture handles out-of-range indices in the selector:

x86_64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup
ARM: if index is out of range (0-15), set to 0
PPC64: use lower 4 bits for lookup (no out of range handling)
MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored)
RISCV: if index is out of range (0-15), set to 0.

The current proposed specification requires all platforms except for PPC64 to mask higher 4 bits using an and instruction, resulting in two instructions (if and with an immediate/memory argument is available).

Alternatively, if the specification says that all out-of-range indices return 0, then ARM and RISCV will use the native instruction, x86_64/PPC64/MIPS will have to generate additional checks using a 3-instruction sequence like "compare with 15; shuffle; andnot".

Finally - and I'm not sure if this is ok to do in WebAssembly, that's what C would do ;) - we could say that if index is 0x80 we return 0, if index is 0-15 we return the index result, otherwise the resulting byte is unspecified. This seems like the cleanest way to unify x86_64 behavior with the rest, resulting in zero-cost implementation for x86_64, ARM, MIPS, RISCV, and a 3-instruction sequence (compare to 0x80, shuffle, andnot) on PPC64.

I would be happy having the instruction with any behavior - it's not free to work around out-of-range behavior in some cases, and with the proposed spec we'd be paying a cost of an extra and (1 cycle assuming the load is free?) for each shuffle, but it's substantially better than status quo.

However I want to make sure we carefully consider all possibilities - if unspecified behavior is acceptable then the third solution is cleanest - it allows us to save 1 cycle in almost all cases, and allows for a cheap way to set elements to 0 which in some cases saves an extra instruction or two in the code that's using shuffles, at the expense of unspecified behavior for some values, and PPC64 that doesn't seem more valuable than other archs.

If unspecified behavior is unacceptable then the behavior as specced seems best, since option 2 is zero-cost for ARM but more expensive for x86_64, whereas option 1 is more balanced.

Maratyszcza · 2019-03-08T06:50:35Z

@zeux Actually, if the specification says that all out-of-range indices return 0, then x86_64 can implement it with just two instructions: unsigned saturated add (PADDUBS) of byte 112 (=128-16) and PSHUFB. Similarly for MIPS MSA, albeit with a different constant.

The "out-of-range indices return 0" schema has an important advantage that it permits emulating larger than 16-byte lookup table. E.g. a 32 byte lookup can be simulated as permute_dyn(low16, idx) + permute_dyn(high16, idx - 16)

zeux · 2019-03-08T07:00:59Z

Good point! Then it seems like the best option is to specify that out-of-range indices should return 0. This is zero cost on ARM and RISCV, and same cost as the currently proposed option on x86_64 and MIPS. PPC64 is the only platform where the currently proposed option is zero-cost but I don't think we should be optimizing for PPC64 over ARM...

lemaitre · 2019-03-08T08:15:10Z

When I wrote the specification of v8x16.permute_dyn in my PR, I knew about ARM out-of-range policy, but I thought pshufbon x86 did not look at upper bits (like what I proposed).
In fact, it is the case for all other "permutevar" from x86, hence the confusion.

I'm happy to change the specification to something more useful and better supported.

So I think the best behavior is the following:

def S.permute_dyn(a, s):
    result = S.New()
    for i in range(S.Lanes):
        idx = s[i]
        if idx & (1 << 7) != 0:
            result[i] = 0
        else if idx < S.Lanes:
            result[i] = a[s[i]]
        else:
            result[i] = unspecified_value()
    return result

Therefore, we get a perfect translation to all major platforms but PPC (where a couple extra instructions woud be required).

If we cannot afford to specify unspecified_value() (or undefined_bahvior()), then we cannot have perfect translation to both ARM and x86.
I tend to prefer unspecified_value() over undefined_behavior() as it is more constrained (eg: it cannot format your filesystem), but still allows perfect codegen whenever possible.

gnzlbg · 2019-03-08T09:31:48Z

I thought pshufbon x86 did not look at upper bits (like what I proposed).

You are not alone, I also forgot about this detail. Thank you @zeux for bringing out this detail to our attention.

In fact, it is the case for all other "permutevar" from x86, hence the confusion.

Note that permutevar is weird in its own way because it uses the second bit of each element to perform the selection, that is, i64x2.permute_dyn(mask: i64x2) with mask = [1, 0] implemented on top of permutevarpd requires a lane-wise mask << 1 to map [1, 0] to [0b10, 0b00] before calling the instruction.

penzn · 2019-03-08T17:20:36Z

Out of range indices have been brought up before, but did not get a whole lot of attention: #11

zeux · 2019-03-08T17:59:59Z

@penzn Right now there are no instructions that take variable shuffle masks. This issue proposes one and some behavior needs to be standardized for out-of-range indices - the discussion is about what behavior we need I think, it's not clear that "modulo 16" is the optimum balance.

penzn · 2019-03-08T18:09:25Z

I understand that, there seems to be consensus in favor of dynamic shuffle. It may make sense to enable both static and dynamic shuffle to support out of range indices if supporting out of range indices is the route.

tlively · 2019-03-08T19:35:47Z

It may make sense to enable both static and dynamic shuffle to support out of range indices if supporting out of range indices is the route.

The compiler can easily ensure that all indices are in range in the static case, so allowing out of range indices for static shuffles gets us extra complexity with no additional utility. In contrast, the tools cannot reason about dynamic shuffles, so there is no way to statically prevent out of bounds indices for dynamic shuffles and they must be handled somehow.

binji · 2019-03-08T20:06:57Z

If unspecified behavior is unacceptable then the behavior as specced seems best, since option 2 is zero-cost for ARM but more expensive for x86_64, whereas option 1 is more balanced.

Completely unspecified behavior is probably unacceptable. Non-deterministic behavior is perhaps OK. For example, we already have non-determinism w.r.t. NaN behavior. That said, we've already decided that it is OK to sacrifice some speed for determinism, e.g. requiring masks on scalar shifts.

zeux · 2019-03-08T22:56:11Z

Right - with the comment from @Maratyszcza it seems to me that the best option is to specify that out-of-range indexing returns 0.

It's the same wrt performance as the currently proposed option on x64 (and presumably MIPS), faster - zero-cost - on ARM and RISCV and slower - non-zero-cost - on PPC64.

This seems like a better tradeoff since it also unlocks savings in the calling code (for example, on x64, in the benchmark I posted earlier on a different thread I'd have to work around the lack of out-of-bounds handling by spending one more instruction, whereas if wasm behavior guarantees this I won't need to do that, so the resulting code is faster).

Additionally, this is also cleaner and easier to explain compared to options that try to special-case current x64 behavior wrt high bit.

This change adds a variable shuffle instruction to SIMD proposal. When indices are out of range, the result is specified as 0 for each lane. This matches hardware behavior on ARM and RISCV architectures. On x86_64 and MIPS, the hardware provides instructions that can select 0 when the high bit is set to 1 (x86_64) or any of the two high bits are set to 1 (MIPS). On these architectures, the backend is expected to emit a pair of instructions, saturating add (saturate(x + (128 - 16)) for x86_64) and permute, to emulate the proposed behavior. To distinguish variable shuffles with immediate shuffles, existing v8x16.shuffle instruction is renamed to v8x16.shuffle2_imm to be explicit about the fact that it shuffles two vectors with an immediate argument. This naming scheme allows for adding variants like v8x16.shuffle2 and v8x16.shuffle1_imm in the future. Fixes #68. Contributes to #24. Fixes #11.

penzn mentioned this issue Mar 7, 2019

Shuffle with immediate indices specification #30

Closed

zeux mentioned this issue Mar 12, 2019

Add v8x16.shuffle1 instruction and rename v8x16.shuffle to v8x16.shuffle2_imm #71

Merged

gnzlbg mentioned this issue Mar 12, 2019

Update the shuffle1_dyn API rust-lang/packed_simd#228

Open

3 tasks

dtig closed this as completed in #71 Mar 27, 2019

AndrewScheidecker mentioned this issue Aug 12, 2019

Inefficient x64 codegen for swizzle #93

Closed

calebzulawski mentioned this issue Jun 28, 2021

Do The SIMD Shuffle rust-lang/portable-simd#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Dynamic shuffle #68

[RFC] Dynamic shuffle #68

penzn commented Mar 7, 2019

gnzlbg commented Mar 7, 2019 •

edited

Loading

tlively commented Mar 7, 2019

zeux commented Mar 8, 2019 •

edited

Loading

Maratyszcza commented Mar 8, 2019 •

edited

Loading

zeux commented Mar 8, 2019

lemaitre commented Mar 8, 2019

gnzlbg commented Mar 8, 2019 •

edited

Loading

penzn commented Mar 8, 2019

zeux commented Mar 8, 2019

penzn commented Mar 8, 2019

tlively commented Mar 8, 2019

binji commented Mar 8, 2019 •

edited

Loading

zeux commented Mar 8, 2019 •

edited

Loading

[RFC] Dynamic shuffle #68

[RFC] Dynamic shuffle #68

Comments

penzn commented Mar 7, 2019

gnzlbg commented Mar 7, 2019 • edited Loading

tlively commented Mar 7, 2019

zeux commented Mar 8, 2019 • edited Loading

Maratyszcza commented Mar 8, 2019 • edited Loading

zeux commented Mar 8, 2019

lemaitre commented Mar 8, 2019

gnzlbg commented Mar 8, 2019 • edited Loading

penzn commented Mar 8, 2019

zeux commented Mar 8, 2019

penzn commented Mar 8, 2019

tlively commented Mar 8, 2019

binji commented Mar 8, 2019 • edited Loading

zeux commented Mar 8, 2019 • edited Loading

gnzlbg commented Mar 7, 2019 •

edited

Loading

zeux commented Mar 8, 2019 •

edited

Loading

Maratyszcza commented Mar 8, 2019 •

edited

Loading

gnzlbg commented Mar 8, 2019 •

edited

Loading

binji commented Mar 8, 2019 •

edited

Loading

zeux commented Mar 8, 2019 •

edited

Loading