Implement Wasm Atomics for Cranelift/newBE/aarch64. #2077

julian-seward1 · 2020-07-29T09:17:38Z

The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups

atomic read-modify-write
atomic compare-and-swap
atomic loads
atomic stores
fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group. Some
comments about these:

for those that take addresses (all except fences), the address is contained
entirely in a single Value; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.
atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
- per above comment, makes alignment checking simpler
- reuse of existing loads and stores would require extension of MemFlags
  to indicate atomicity, which sounds semantically unclean. For example,
  then any instruction carrying MemFlags could be marked as atomic, even
  in cases where it is meaningless or ambiguous.
I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in code_translator.rs is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group. All of them except ::Fence contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in get_regs that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC and ::CAS) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method <LowerCtx for Lower>::ensure_in_vreg.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.

bjorn3

🎉

cranelift/codegen/meta/src/shared/instructions.rs

bjorn3 · 2020-07-29T09:25:46Z

cranelift/codegen/src/ir/atomic_rmw_op.rs

+    /// Xor
+    Xor,
+    /// Exchange
+    Xchg,


cg_clif also requires nand, max, min, umax, umin.

github-actions · 2020-07-29T09:49:37Z

Subscribe to Label Action

cc @bnjbvr, @peterhuene

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:meta", "cranelift:wasm", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

bnjbvr: cranelift
peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

julian-seward1 · 2020-07-29T16:00:33Z

On further reflection, I am inclined to rename two of the new AArch64
instructions so as to be more consistent with CLIF. At the CLIF level, the
patch adds 5 new instructions:

  atomic_rmw, atomic_cas, atomic_load, atomic_store, fence

At the AArch64 level, the corresponding new instructions are

  LLSC, CAS, AtomicLoad, AtomicStore, Fence.

I am thinking these should actually be

  AtomicRMW, AtomicCAS, AtomicLoad, AtomicStore, Fence.

Implying, via the name, that the first of these (LLSC) contains an LL-SC loop
is misleading, since any atomic modification of memory will require use of LL
and SC. CAS also uses LL and SC, for example. Also, it would be better not
to imply the use of LL and SC at all, since some AArch64 revisions after 8.0
support compare-and-swap directly (I think), and so we could emit those
differently on such targets in future.

Reviewers: any objections to the proposed renaming?

cfallin · 2020-07-29T16:55:44Z

+1 to the proposed Inst names. In particular LLSC -> AtomicRMW; the latter corresponds to the actual semantics, the former is an implementation detail.

cfallin

This is really great -- I'm happy we'll finally have atomics support!

My main comments below have to do with the emission code, and the desire to make a bit more use of the helpers and other infrastructure rather than hardcoding hex constants; and some thoughts on the RMW ops. The overall design is good though!

cranelift/codegen/meta/src/shared/immediates.rs

cranelift/codegen/meta/src/shared/instructions.rs

cranelift/codegen/src/isa/aarch64/lower_inst.rs

cranelift/codegen/src/machinst/lower.rs

cranelift/wasm/src/code_translator.rs

crates/environ/src/func_environ.rs

bnjbvr

I skimmed this, and confirm the meta-code is correct, namely the configuration of type variables in new instructions do what the comments say (thanks for the extensive comments, btw!). Please make sure to fix x64 compilation before landing.

cranelift/codegen/src/isa/aarch64/inst/emit.rs

cranelift/wasm/src/environ/dummy.rs

cranelift/wasm/src/environ/spec.rs

cfallin

LGTM -- just one tiny nit for clarity below, otherwise happy to see this land. Thanks!

cranelift/codegen/src/isa/aarch64/inst/args.rs

The implementation is pretty straightforward. Wasm atomic instructions fall into 5 groups * atomic read-modify-write * atomic compare-and-swap * atomic loads * atomic stores * fences and the implementation mirrors that structure, at both the CLIF and AArch64 levels. At the CLIF level, there are five new instructions, one for each group. Some comments about these: * for those that take addresses (all except fences), the address is contained entirely in a single `Value`; there is no offset field as there is with normal loads and stores. Wasm atomics require alignment checks, and removing the offset makes implementation of those checks a bit simpler. * atomic loads and stores get their own instructions, rather than reusing the existing load and store instructions, for two reasons: - per above comment, makes alignment checking simpler - reuse of existing loads and stores would require extension of `MemFlags` to indicate atomicity, which sounds semantically unclean. For example, then *any* instruction carrying `MemFlags` could be marked as atomic, even in cases where it is meaningless or ambiguous. * I tried to specify, in comments, the behaviour of these instructions as tightly as I could. Unfortunately there is no way (per my limited CLIF knowledge) to enforce the constraint that they may only be used on I8, I16, I32 and I64 types, and in particular not on floating point or vector types. The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable. At the AArch64 level, there are also five new instructions, one for each group. All of them except `::Fence` contain multiple real machine instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual load-linked store-conditional loops, guarded at both ends by memory fences. Atomic loads and stores are emitted as a load preceded by a fence, and a store followed by a fence, respectively. The amount of fencing may be overkill, but it reflects exactly what the SM Wasm baseline compiler for AArch64 does. One reason to implement r-m-w and c-a-s as a single insn which is expanded only at emission time is that we must be very careful what instructions we allow in between the load-linked and store-conditional. In particular, we cannot allow *any* extra memory transactions in there, since -- particularly on low-end hardware -- that might cause the transaction to fail, hence deadlocking the generated code. That implies that we can't present the LL/SC loop to the register allocator as its constituent instructions, since it might insert spills anywhere. Hence we must present it as a single indivisible unit, as we do here. It also has the benefit of reducing the total amount of work the RA has to do. The only other notable feature of the r-m-w and c-a-s translations into AArch64 code, is that they both need a scratch register internally. Rather than faking one up by claiming, in `get_regs` that it modifies an extra scratch register, and having to have a dummy initialisation of it, these new instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so as to be call-clobbered, hence their use is less likely to interfere with long live ranges that span calls. One subtlety regarding the use of completely fixed input and output registers is that we must be careful how the surrounding copy from/to of the arg/result registers is done. In particular, it is not safe to simply emit copies in some arbitrary order if one of the arg registers is a real reg. For that reason, the arguments are first moved into virtual regs if they are not already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`. Again, we rely on coalescing to turn them into no-ops in the common case. There is also a ridealong fix for the AArch64 lowering case for `Opcode::Trapif | Opcode::Trapff`, which removes a bug in which two trap insns in a row were generated. In the patch as submitted there are 6 "FIXME JRS" comments, which mark things which I believe to be correct, but for which I would appreciate a second opinion. Unless otherwise directed, I will remove them for the final commit but leave the associated code/comments unchanged.

julian-seward1 requested review from bnjbvr and cfallin July 29, 2020 09:27

bjorn3 reviewed Jul 29, 2020

View reviewed changes

cfallin reviewed Jul 29, 2020

View reviewed changes

julian-seward1 force-pushed the atomics-CL branch 2 times, most recently from bd6e2c5 to 1b7c1b2 Compare July 31, 2020 07:08

bnjbvr approved these changes Jul 31, 2020

View reviewed changes

cranelift/codegen/src/isa/aarch64/inst/emit.rs Outdated Show resolved Hide resolved

cranelift/wasm/src/environ/dummy.rs Outdated Show resolved Hide resolved

cranelift/wasm/src/environ/spec.rs Outdated Show resolved Hide resolved

julian-seward1 force-pushed the atomics-CL branch 2 times, most recently from 115490f to db3a0ac Compare August 3, 2020 12:58

julian-seward1 requested a review from cfallin August 3, 2020 13:53

cfallin approved these changes Aug 3, 2020

View reviewed changes

cranelift/codegen/src/isa/aarch64/inst/args.rs Show resolved Hide resolved

julian-seward1 force-pushed the atomics-CL branch from db3a0ac to 23a3953 Compare August 4, 2020 06:42

julian-seward1 merged commit 25e3173 into bytecodealliance:main Aug 4, 2020

bjorn3 mentioned this pull request Aug 24, 2020

Atomics #1059

Closed

cfallin mentioned this pull request Jul 30, 2021

threads: Addresses for atomic operations silently wrap instead of using an effective 33-bit index #3132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Wasm Atomics for Cranelift/newBE/aarch64. #2077

Implement Wasm Atomics for Cranelift/newBE/aarch64. #2077

julian-seward1 commented Jul 29, 2020

bjorn3 left a comment

bjorn3 Jul 29, 2020

github-actions bot commented Jul 29, 2020

julian-seward1 commented Jul 29, 2020

cfallin commented Jul 29, 2020 •

edited

Loading

cfallin left a comment

bnjbvr left a comment •

edited

Loading

cfallin left a comment

Implement Wasm Atomics for Cranelift/newBE/aarch64. #2077

Implement Wasm Atomics for Cranelift/newBE/aarch64. #2077

Conversation

julian-seward1 commented Jul 29, 2020

bjorn3 left a comment

Choose a reason for hiding this comment

bjorn3 Jul 29, 2020

Choose a reason for hiding this comment

github-actions bot commented Jul 29, 2020

Subscribe to Label Action

julian-seward1 commented Jul 29, 2020

cfallin commented Jul 29, 2020 • edited Loading

cfallin left a comment

Choose a reason for hiding this comment

bnjbvr left a comment • edited Loading

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment

cfallin commented Jul 29, 2020 •

edited

Loading

bnjbvr left a comment •

edited

Loading