Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

[RFC] Dynamic shuffle #68

Closed
penzn opened this issue Mar 7, 2019 · 13 comments
Closed

[RFC] Dynamic shuffle #68

penzn opened this issue Mar 7, 2019 · 13 comments

Comments

@penzn
Copy link
Contributor

penzn commented Mar 7, 2019

Background in #30 by @lemaitre, also part of #24.

I would like to narrow this issue down to just i8x16.permute_dyn instruction proposed in #30, which would shuffle contents of one vector using lanes of the second vector as indices.

@zeux and @gnzlbg point out that dynamic permute is needed for bytewise processing on structures of dynamic width. Major hardware vendors support some form of dynamic shuffles.

From proposed changes in #30:

  • v8x16.permute_dyn(a: v128, s: v128) -> v128

Returns a new vector with lanes selected from the lanes of the first input vector
a and specified in the second input vector s. The indices from s are first
fit into the range [0, 15] via a modulo.

def S.permute_dyn(a, s):
    result = S.New()
    for i in range(S.Lanes):
      result[i] = a[s[i] % S.Lanes]
    return result
@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

To the best of my knowledge, the following hardware natively support this operation as a single instruction: $arch: $ISA($instr,*),*

  • x86_64: SSSE3 (pshufb), AVX2 (vpshufb), AVX-512 (vpshufb)
  • arm: NEON ({v}tb{l,x})
  • ppc64: altivec (vperm), vsx (xxperm)
  • mips: msa (vshf)
  • riscv: V(vrgather)

@tlively
Copy link
Member

tlively commented Mar 7, 2019

From previous conversation it seems to me that this would be easy to implement both in engines and in tooling and can lead to clear performance wins for some applications. I personally see no problem with including this in MVP.

@zeux
Copy link
Contributor

zeux commented Mar 8, 2019

Just in case I want to address the possible options for handling out-of-range indices.

Here's how each architecture handles out-of-range indices in the selector:

x86_64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup
ARM: if index is out of range (0-15), set to 0
PPC64: use lower 4 bits for lookup (no out of range handling)
MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored)
RISCV: if index is out of range (0-15), set to 0.

The current proposed specification requires all platforms except for PPC64 to mask higher 4 bits using an and instruction, resulting in two instructions (if and with an immediate/memory argument is available).

Alternatively, if the specification says that all out-of-range indices return 0, then ARM and RISCV will use the native instruction, x86_64/PPC64/MIPS will have to generate additional checks using a 3-instruction sequence like "compare with 15; shuffle; andnot".

Finally - and I'm not sure if this is ok to do in WebAssembly, that's what C would do ;) - we could say that if index is 0x80 we return 0, if index is 0-15 we return the index result, otherwise the resulting byte is unspecified. This seems like the cleanest way to unify x86_64 behavior with the rest, resulting in zero-cost implementation for x86_64, ARM, MIPS, RISCV, and a 3-instruction sequence (compare to 0x80, shuffle, andnot) on PPC64.

I would be happy having the instruction with any behavior - it's not free to work around out-of-range behavior in some cases, and with the proposed spec we'd be paying a cost of an extra and (1 cycle assuming the load is free?) for each shuffle, but it's substantially better than status quo.

However I want to make sure we carefully consider all possibilities - if unspecified behavior is acceptable then the third solution is cleanest - it allows us to save 1 cycle in almost all cases, and allows for a cheap way to set elements to 0 which in some cases saves an extra instruction or two in the code that's using shuffles, at the expense of unspecified behavior for some values, and PPC64 that doesn't seem more valuable than other archs.

If unspecified behavior is unacceptable then the behavior as specced seems best, since option 2 is zero-cost for ARM but more expensive for x86_64, whereas option 1 is more balanced.

@Maratyszcza
Copy link
Contributor

Maratyszcza commented Mar 8, 2019

@zeux Actually, if the specification says that all out-of-range indices return 0, then x86_64 can implement it with just two instructions: unsigned saturated add (PADDUBS) of byte 112 (=128-16) and PSHUFB. Similarly for MIPS MSA, albeit with a different constant.

The "out-of-range indices return 0" schema has an important advantage that it permits emulating larger than 16-byte lookup table. E.g. a 32 byte lookup can be simulated as permute_dyn(low16, idx) + permute_dyn(high16, idx - 16)

@zeux
Copy link
Contributor

zeux commented Mar 8, 2019

Good point! Then it seems like the best option is to specify that out-of-range indices should return 0. This is zero cost on ARM and RISCV, and same cost as the currently proposed option on x86_64 and MIPS. PPC64 is the only platform where the currently proposed option is zero-cost but I don't think we should be optimizing for PPC64 over ARM...

@lemaitre
Copy link

lemaitre commented Mar 8, 2019

When I wrote the specification of v8x16.permute_dyn in my PR, I knew about ARM out-of-range policy, but I thought pshufbon x86 did not look at upper bits (like what I proposed).
In fact, it is the case for all other "permutevar" from x86, hence the confusion.

I'm happy to change the specification to something more useful and better supported.

So I think the best behavior is the following:

def S.permute_dyn(a, s):
    result = S.New()
    for i in range(S.Lanes):
        idx = s[i]
        if idx & (1 << 7) != 0:
            result[i] = 0
        else if idx < S.Lanes:
            result[i] = a[s[i]]
        else:
            result[i] = unspecified_value()
    return result

Therefore, we get a perfect translation to all major platforms but PPC (where a couple extra instructions woud be required).

If we cannot afford to specify unspecified_value() (or undefined_bahvior()), then we cannot have perfect translation to both ARM and x86.
I tend to prefer unspecified_value() over undefined_behavior() as it is more constrained (eg: it cannot format your filesystem), but still allows perfect codegen whenever possible.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 8, 2019

I thought pshufbon x86 did not look at upper bits (like what I proposed).

You are not alone, I also forgot about this detail. Thank you @zeux for bringing out this detail to our attention.

In fact, it is the case for all other "permutevar" from x86, hence the confusion.

Note that permutevar is weird in its own way because it uses the second bit of each element to perform the selection, that is, i64x2.permute_dyn(mask: i64x2) with mask = [1, 0] implemented on top of permutevarpd requires a lane-wise mask << 1 to map [1, 0] to [0b10, 0b00] before calling the instruction.

@penzn
Copy link
Contributor Author

penzn commented Mar 8, 2019

Out of range indices have been brought up before, but did not get a whole lot of attention: #11

@zeux
Copy link
Contributor

zeux commented Mar 8, 2019

@penzn Right now there are no instructions that take variable shuffle masks. This issue proposes one and some behavior needs to be standardized for out-of-range indices - the discussion is about what behavior we need I think, it's not clear that "modulo 16" is the optimum balance.

@penzn
Copy link
Contributor Author

penzn commented Mar 8, 2019

I understand that, there seems to be consensus in favor of dynamic shuffle. It may make sense to enable both static and dynamic shuffle to support out of range indices if supporting out of range indices is the route.

@tlively
Copy link
Member

tlively commented Mar 8, 2019

It may make sense to enable both static and dynamic shuffle to support out of range indices if supporting out of range indices is the route.

The compiler can easily ensure that all indices are in range in the static case, so allowing out of range indices for static shuffles gets us extra complexity with no additional utility. In contrast, the tools cannot reason about dynamic shuffles, so there is no way to statically prevent out of bounds indices for dynamic shuffles and they must be handled somehow.

@binji
Copy link
Member

binji commented Mar 8, 2019

If unspecified behavior is unacceptable then the behavior as specced seems best, since option 2 is zero-cost for ARM but more expensive for x86_64, whereas option 1 is more balanced.

Completely unspecified behavior is probably unacceptable. Non-deterministic behavior is perhaps OK. For example, we already have non-determinism w.r.t. NaN behavior. That said, we've already decided that it is OK to sacrifice some speed for determinism, e.g. requiring masks on scalar shifts.

@zeux
Copy link
Contributor

zeux commented Mar 8, 2019

Right - with the comment from @Maratyszcza it seems to me that the best option is to specify that out-of-range indexing returns 0.

It's the same wrt performance as the currently proposed option on x64 (and presumably MIPS), faster - zero-cost - on ARM and RISCV and slower - non-zero-cost - on PPC64.

This seems like a better tradeoff since it also unlocks savings in the calling code (for example, on x64, in the benchmark I posted earlier on a different thread I'd have to work around the lack of out-of-bounds handling by spending one more instruction, whereas if wasm behavior guarantees this I won't need to do that, so the resulting code is faster).

Additionally, this is also cleaner and easier to explain compared to options that try to special-case current x64 behavior wrt high bit.

@dtig dtig closed this as completed in #71 Mar 27, 2019
dtig pushed a commit that referenced this issue Mar 27, 2019
This change adds a variable shuffle instruction to SIMD proposal.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

To distinguish variable shuffles with immediate shuffles, existing
v8x16.shuffle instruction is renamed to v8x16.shuffle2_imm to be
explicit about the fact that it shuffles two vectors with an immediate
argument.

This naming scheme allows for adding variants like v8x16.shuffle2 and
v8x16.shuffle1_imm in the future.

Fixes #68.
Contributes to #24.
Fixes #11.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants