Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISC-V] codegen seems not ideal #361

Closed
japaric opened this issue Jun 19, 2019 · 10 comments
Closed

[RISC-V] codegen seems not ideal #361

japaric opened this issue Jun 19, 2019 · 10 comments
Labels

Comments

@japaric
Copy link
Member

japaric commented Jun 19, 2019

Disclaimer: I don't know much of the RISC-V ISA so this may be just my poor
understanding.

Some things I observed while working with the hifive1 board.

NOTE: wherever the (compilation) target is not specified,
riscv32imac-unknown-none-elf was used

jalr instead of jal

All direct function calls seems to use some sort of relocatable form. Example
below:

#[no_mangle]
unsafe fn _start() -> ! {
    foo();

    loop {}
}

#[inline(never)]
#[no_mangle]
unsafe fn foo() {
    asm!("" : : : "memory" : "volatile")
}
$ cargo objdump --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       addi    sp, sp, -16
   11002:       sw      ra, 12(sp)
   11004:       sw      s0, 8(sp)
   11006:       addi    s0, sp, 16
   11008:       auipc   ra, 0
   1100c:       jalr    ra, ra, 10
   11010:       j       0

0000000000011012 foo:
   11012:       addi    sp, sp, -16
   11014:       sw      ra, 12(sp)
   11016:       sw      s0, 8(sp)
   11018:       addi    s0, sp, 16
   1101a:       lw      s0, 8(sp)
   1101c:       lw      ra, 12(sp)
   1101e:       addi    sp, sp, 16
   11020:       ret

That auipc, jalr combination looks like relocatable code to me (not that I
know much about relocatable code). I was expecting to see jal function calls
of this form:

global_asm!(r#"
  .global _start
_start:
  addi    sp, sp, -16
  sw      ra, 12(sp)
  sw      s0, 8(sp)
  addi    s0, sp, 16
  jal     foo
  j       0
"#);

#[inline(never)]
#[no_mangle]
unsafe fn foo() {
    asm!("" : : : "memory" : "volatile")
}
$ cargo objdump --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       addi    sp, sp, -16
   11002:       sw      ra, 12(sp)
   11004:       sw      s0, 8(sp)
   11006:       addi    s0, sp, 16
   11008:       jal     6
   1100c:       j       0

000000000001100e foo:
   1100e:       addi    sp, sp, -16
   11010:       sw      ra, 12(sp)
   11012:       sw      s0, 8(sp)
   11014:       addi    s0, sp, 16
   11016:       lw      s0, 8(sp)
   11018:       lw      ra, 12(sp)
   1101a:       addi    sp, sp, 16
   1101c:       ret

Since the target specification says relocation-model: "static":

$ rustc -Z unstable-options --print target-spec-json --target riscv32imac-unknown-none-elf
(..)
  "relocation-model": "static",
(..)

That would save 4 bytes of .text per function call.

atomic::compiler_fence produces an instruction

This program shows that atomic::compiler_fence produces a fence instruction

#[no_mangle]
unsafe fn _start() -> ! {
    atomic::compiler_fence(Ordering::SeqCst);

    loop {}
}
$ cargo objdump --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       addi    sp, sp, -16
   11002:       sw      ra, 12(sp)
   11004:       sw      s0, 8(sp)
   11006:       addi    s0, sp, 16
   11008:       fence   rw, rw
   1100c:       j       0

Even though the description of the function states that "compiler_fence does
not emit any machine code".

As it is, both atomic::fence and atomic::compiler_fence generate the same
machine code.

#[no_mangle]
unsafe fn _start() -> ! {
    atomic::fence(Ordering::SeqCst);

    loop {}
}
$ cargo objdump --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       addi    sp, sp, -16
   11002:       sw      ra, 12(sp)
   11004:       sw      s0, 8(sp)
   11006:       addi    s0, sp, 16
   11008:       fence   rw, rw
   1100c:       j       0

You don't see this behavior with the ARM Cortex-M backend:

#[no_mangle]
unsafe fn _start() -> ! {
    atomic::compiler_fence(Ordering::SeqCst);

    loop {}
}
$ cargo objdump --target thumbv7m-none-eabi --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       b       #-4 <_start>

intrinsics::abort != UNIMP

Calling intrinsics::abort produces a function call to the abort symbol even
though the UNIMP instruction exists.

#[no_mangle]
unsafe fn _start() -> ! {
    core::intrinsics::abort()
}

global_asm!(r#"
  .global abort
abort:
  UNIMP
"#);
$ cargo objdump --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 abort:
   11000:       unimp

0000000000011002 _start:
   11002:       addi    sp, sp, -16
   11004:       sw      ra, 12(sp)
   11006:       sw      s0, 8(sp)
   11008:       addi    s0, sp, 16
   1100a:       auipc   ra, 0
   1100e:       jalr    ra, ra, -10
   11012:       auipc   ra, 0
   11016:       jalr    ra, ra, -18

I was actually expecting something like this to be generated:

global_asm!(r#"
  .global _start
_start:
  UNIMP
"#);
$ cargo objdump --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       unimp

noreturn nounwind & divergent functions

The riscv32imac-unknown-none-elf target uses panic-strategy: "abort"; that
tells the backend that functions never unwind so divergent functions fn() -> !
, which never return and are marked as noreturn nounwind in LLVM IR, should
not preserve the caller "saved registers". However, one observes register
stacking in the following program:

#[no_mangle]
unsafe fn _start() -> ! {
    // (RISC-V has so many registers)
    asm!("" : :
         "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) "r"(7) "r"(8) "r"(9)
         "r"(10) "r"(11) "r"(12) "r"(13) "r"(14) "r"(15) "r"(16)
         : : "volatile");

    loop {}
}
$ cargo objdump --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       addi    sp, sp, -16
   11002:       sw      ra, 12(sp)
   11004:       sw      s0, 8(sp)
   11006:       sw      s1, 4(sp)
   11008:       addi    s0, sp, 16
   1100a:       addi    a6, zero, 1
   1100c:       addi    a7, zero, 2
   1100e:       addi    t0, zero, 3
   11010:       addi    t1, zero, 4
   11012:       addi    t2, zero, 5
   11014:       addi    t3, zero, 6
   11016:       addi    t4, zero, 7
   11018:       addi    t5, zero, 8
   1101a:       addi    t6, zero, 9
   1101c:       addi    a3, zero, 10
   1101e:       addi    a4, zero, 11
   11020:       addi    a5, zero, 12
   11022:       addi    a0, zero, 13
   11024:       addi    a1, zero, 14
   11026:       addi    a2, zero, 15
   11028:       addi    s1, zero, 16
   1102a:       j       0

s0 and s1 are pushed onto the stack but never popped. Is that required by
the ISA / C ABI?

(off-topic: why is ra also being pushed onto the stack when _start performs
no function call?)

Compare the previous program to this ARM Cortex-M program:

#[no_mangle]
unsafe fn _start() -> ! {
    asm!("" : :
         "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) "r"(7) "r"(8) "r"(9)
         "r"(10) "r"(11) "r"(12) "r"(13)
         : : "volatile");

    loop {}
}
$ cargo objdump --target thumbv7m-none-eabi --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       mov.w   r12, #0
   11004:       mov.w   lr, #1
   11008:       mov.w   r8, #2
   1100c:       mov.w   r9, #3
   11010:       mov.w   r10, #4
   11014:       mov.w   r11, #5
   11018:       movs    r4, #6
   1101a:       movs    r5, #7
   1101c:       movs    r6, #8
   1101e:       movs    r7, #9
   11020:       movs    r2, #10
   11022:       movs    r3, #11
   11024:       movs    r0, #12
   11026:       movs    r1, #13
   11028:       b       #-4 <_start+0x28>

Registers are never pushed onto the stack.

Compare that machine code to the machine code generated for a non-divergent
function:

#[no_mangle]
unsafe fn _start() {
    asm!("" : :
         "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) "r"(7) "r"(8) "r"(9)
         "r"(10) "r"(11) "r"(12) "r"(13)
         : : "volatile");
}
$ cargo objdump --target thumbv7m-none-eabi --bin app --release -- -C -d -no-show-raw-insn
0000000000011000 _start:
   11000:       push.w  {r4, r5, r6, r7, r8, r9, r10, r11, lr}
   11004:       mov.w   r12, #0
   11008:       mov.w   lr, #1
   1100c:       mov.w   r8, #2
   11010:       mov.w   r9, #3
   11014:       mov.w   r10, #4
   11018:       mov.w   r11, #5
   1101c:       movs    r4, #6
   1101e:       movs    r5, #7
   11020:       movs    r6, #8
   11022:       movs    r7, #9
   11024:       movs    r2, #10
   11026:       movs    r3, #11
   11028:       movs    r0, #12
   1102a:       movs    r1, #13
   1102c:       pop.w   {r4, r5, r6, r7, r8, r9, r10, r11, pc}

Registers are pushed in the prologue and then popped in the epilogue of the
function.


It would be good to check if these backend issues have been fixed in the latest
version of LLVM (using llc) because rustc is using a several months old
LLVM. If a recent commit shows the same issues then we may want to submit bug
reports to the LLVM project.

Metadata

$ rustc -Vv
rustc 1.37.0-nightly (04a3dd8a8 2019-06-18)
binary: rustc
commit-hash: 04a3dd8a872633ca1e4c217d11f741cc35cb19a5
commit-date: 2019-06-18
host: x86_64-unknown-linux-gnu
release: 1.37.0-nightly
LLVM version: 8.0

cc @rust-embedded/riscv @Disasm

@Disasm
Copy link
Member

Disasm commented Jun 20, 2019

Oh, someday I'll have to become an LLVM developer. I'm 99% sure these strange things have nothing to do with Rust itself, but I need to check anyway.
Thanks for the observations!

@Disasm
Copy link
Member

Disasm commented Jun 24, 2019

atomic::compiler_fence produces an instruction

Even though the description of the function states that "compiler_fence does
not emit any machine code".

The description is wrong. This compiler_fence lowers to atomic_fence(ordering, SingleThread) and this atomic_fence can be lowered to any architecture-dependent code, not only to empty code.
LLVM lowers atomic_fence(ordering, SingleThread) to CompilerBarrier for AArch64 and ARM targets, but not for AVR, PowerPC, RISC-V and Spark. Of cause, this "SingleThread means CompilerBarrier" hack could be added for RISC-V too, but I do not see any adequate reason for this. Something is wrong on the Rust side.

intrinsics::abort != UNIMP

Are there any guidelines on abort implementation? I think that UNIMP instruction can cause "Illegal instruction" exception and further stack overrun. Infinite loop inside abort seems like a better idea to me.

noreturn nounwind & divergent functions

The same for clang: https://godbolt.org/z/YPa98Z
However, this register save code elimination is quite complex thing because you need to proof that it's useless. At the same time we do not see register restores because this code was removed during the dead code elimination process, which is easier. Also you will get the same "optimizations" with gcc.

jalr instead of jal

I do not have any answer yet, trying to investigate...

@Disasm
Copy link
Member

Disasm commented Jun 30, 2019

jalr instead of jal

Compiler generates call <something> instruction, which translates into auipc+jalr pair (The RISC-V Instruction Set Manual Volume I: User-Level ISA v2.2, p.110, table 20.2). That what you can see in *.o files because at this point it's impossible to determine whether you can call your function with jal. However, linker could translate this to jal (or even j) instruction if it's instructed to do so with linker relaxation attribute. By default, LLVM do not use this attribute for some reason, but it's possible to turn it on.

Unfortunately in Rust this doesn't work: the attribute starts to appear in object files, but rust-lld ignores it and leaves instructions as is. I need to figure out why.

@lenary
Copy link

lenary commented Aug 1, 2019

Hi! I'm a developer at lowRISC, working with Alex Bradbury on the RISC-V llvm backend. I've just started (yesterday!) looking at where we can improve the RISC-V Rust backend, in addition to my main LLVM work.

The LLVM backend only recently became non-experimental (around the time of the 9.0 branch, with releases imminent). With that, the aim is that we can compile most RISC-V programs in rv{32,64}imacfd, supporting the main ABIs. We know there are still lots of opportunities to produce "better" code. I'm happy to note that none of these issues make your program incorrect, they just make it use a slightly

Looking at these issues:

  • jalr instead of jal. You are correct that this is an optimisation commonly performed by the linker, during linker relaxation. I'm not sure LLD does linker relaxation yet, nor am I sure there are any outstanding patches so it may do so. GNU ld will definitely perform this optimisation (the correct relocations should be being emitted).
  • intrinsics::abort I think we haven't really addressed many custom/intrinsic lowerings in the RISC-V llvm backend yet. I shall take a closer look at this.
  • atomic::compiler_fence I will take a look at how this is lowered. It may be a case of doing a different lowering in the RISC-V backend, but somehow this feels like a higher-level issue than that. I'm going to have to look at some more resources, and see if this should instead be a change somewhere in the target-independent part of LLVM.
  • Stack usage in noreturn, nounwind functions: Right now it seems like the compiler is naive, and saves registers it doesn't need to. We can look at this not happening (subject to the ABI), but it's not a top priority.

I hope this clears up what's going on!

@SUPERCILEX
Copy link

For anyone working on branch prediction, this was just recently fixed: rust-lang/rust#109426. Enable like this: SUPERCILEX/hardcaml_riscv@d1125a6#diff-f34d15ec63ba1edb3a7a25b6a3e3b7d3c1f02cbef0b66a6a73e8cf5c959a8576

@rmsyn
Copy link
Contributor

rmsyn commented Aug 17, 2024

jalr instead of jal

This has been resolved since this issue was opened: https://godbolt.org/z/zqET14TEs

noreturn nounwind and divergent functions

This has been resolved since this issue was opened: https://godbolt.org/z/5bsEss6M3

intrinsics::abort != UNIMP

This has been resolved since this issue was opened: https://godbolt.org/z/qv7GndhEc

atomic::compiler_fence produces an instruction

This issue seems to still exist in the current stable and nightly compilers: https://godbolt.org/z/ao3464vbe


Out of the original issues noted, the only remaining issue seems to be atomic::compiler_fence and atomic::fence emitting code.

Not sure if these should actually lower into a fence and/or fence.i RISC-V instruction, or what the desired behavior is.

@jamesmunns
Copy link
Member

The compiler fence seems to disappear when -C opt-level="..." is passed. It's likely the compiler fence is making a function call but doing nothing, but once optimized the actual call is inlined and the side effect is removed.

@rmsyn
Copy link
Contributor

rmsyn commented Aug 18, 2024

The compiler fence seems to disappear when -C opt-level="..." is passed.

Confirmed that this works, atomic::compiler_fence disappears, where atomic::fence emits a fence instruction.

So, I guess this would only cause an issue for users trying to run debug builds, expecting either of those calls to do something?

This could be an annoying "gotcha" for the atomic::fence call.

@BartMassey
Copy link
Member

Everything sounds like it's working more-or-less as expected now. Anyone object to closing this issue?

@adamgreig
Copy link
Member

Thanks for triaging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants