Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the auipcc representability challenge #56

Closed
sorear opened this issue Jan 26, 2024 · 14 comments
Closed

the auipcc representability challenge #56

sorear opened this issue Jan 26, 2024 · 14 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation For next release

Comments

@sorear
Copy link
Contributor

sorear commented Jan 26, 2024

Call a segment a sequence of one of more functions and data objects packed together and accessible using a single value of pcc. Data references and function pointer (sentry) construction from the functions in the segment to the other contents of the segment will use sequences of AUIPCC and CINCOFFSETIMM or load/store instructions, mimicking the use of auipc and addi or load/store sequences in the base ISA.

AUIPCC constructs a capability whose address is within [-2047,2048] of the symbol address. Assuming that the symbol address is strictly within or at one end of the segment, the intermediate capability constructed by AUIPCC will have an address between 2047 before the segment base and 2048 after the segment end.

Under RV64, this is always representable, because all capabilities have a representable range that covers at least 2048 before and 8193 after their bounds. Under RV32, the representability depends on the segment size. If the segment is strictly less than 2048 bytes, then the only AUIPCC instructions which appear will have an immediate of 0, which returns the pcc unchanged and is always representable. If the segment is 4096 bytes or larger, it will have a representable range covering 2048 before and after and any intermediate value for any address inside the segment will be representable.

The problematic case for RV32 is segments of [2048,4095] bytes. Within that size range, addresses at least 2049 bytes after the end are guaranteed representable, but only 1024 before the base. If an AUIPCC near the end of the segment is used for a symbol near the beginning, the AUIPCC value can fall off the representable region, resulting in an untagged capability.

Plausible solutions:

  • Change AUIPCC immediate decoding for RV32 Capability mode to produce imm[11], presumably from insn[30].
    Pro: provides access to the entire representable region of segments of any size. Simple access semantics.
    Con: Notable change from base ISA behavior. Restricts segment sizes to 1GiB.
  • Add a S2048PCC instruction, to cover the one missed case (access 2048 to 4096 bytes before pcc uses S2048PCC and is monotonic).
    Pro: Accesses the entire segment with minimal opcode space impact. No change to immediate encodings.
    Con: Brittle. Any change to the capability design which affects representable regions may require adding or redefining instructions.
  • Pad the beginning of segments in the problematic [2048,4095] size range with 1024 bytes of unused data, or enough to reach a safe size of 4096.
    Pro: No ISA changes. No relocation changes.
    Con: Padding rule may be difficult to express in some linker script languages. Wasted space could reach 33% if the distribution of segment sizes in an application peaks between 2048 and 4096, which could be common in fine-grained settings. Space usage can be optimized significantly laying out AUIPCC-referenced symbols at the end of the segment and checking the segment contents for PCREL_HI relocations that significantly overshoot the beginning before applying padding, but this would be a new cross-cutting feature that could approach linker relaxation in complexity.
@jonwoodruff
Copy link
Contributor

jonwoodruff commented Jan 26, 2024

Excellent description of the problem!
First: would this be resolved if we had a single instruction that simply wrote PCC to a register unchanged?
Second: do we have such an instruction? "JAL rd, 4" maybe?

I guess the different style of drumming PC-relative addresses for these small cases might be problematic where linkers are meant to come in behind and fix things up... I don't know if there is already any variety in relocation (not sure if this is the correct use of the word?) styles that might be required for reaching different distances.

@sorear
Copy link
Contributor Author

sorear commented Jan 27, 2024

would this be resolved if we had a single instruction that simply wrote PCC to a register unchanged?

We have such an instruction: auipcc with an immediate of 0. I don't see how this helps solve the "build PCC - 2100 in 2 instructions" problem.

I don't know if there is already any variety in relocation (not sure if this is the correct use of the word?) styles that might be required for reaching different distances.

A compiler creates the sequence of "instructions" to access a symbol. It does not know how far away the symbol is (although the -mcmodel option provides an upper bound), so it must use a sequence of "instructions" that will work for any sufficiently small displacement.

The linker rewrites the "instructions" into actual instructions once the symbol addresses are known. The rewriting process can change any bit in the "instructions" (don't get hung up over whether the mnemonic is the same, etc) but it cannot change the register usage, because register allocation and spilling was already done by the compiler.

On most architectures, the linker cannot change the size of the "instructions" during relocation processing, so the compiler also fixes the number of bytes used. RISC-V provides a "relaxation" feature which allows "instructions" to be replaced with shorter instructions. This process is only guaranteed to reach a fixed point if smaller (in absolute value) displacements always result in shorter instruction sequences. Relaxation provides small (1-2% total code size) benefits, and is complicated to support in the linker, so it is optional; all symbol references should be able to complete without it.

In RISC-V CHERI's default code model, all symbol references start as 8 bytes with a range of ±2GiB:

auipcc REG, 0; cincoffsetimm REG, REG, 0    Reference to the symbol with pcc's bounds
auipcc REG, 0; clc REG, 0(REG)    Load a global or GOT entry, also ld lw lwu lh lhu lb lbu
auipcc REG, 0; csc REG, 0(REG)    Store a global, immediate in a different place, also sd sw sh sb
auipcc REG, 0; jalr REG, REG, 0   Jump to a label, call or tail call a function

The last case can be relaxed to 4 byte jal REG, offset for ±1MiB or 2 byte c.j/c.jal OFFSET for ±2KiB, REG = x0 or (rv32 only) x1.

The base sequences do not work for a narrow range of offsets and pcc bounds, as described in the original issue. The first option changes the base sequence to work for all offsets, at a cost in range. The second option allows the linker to fix the sequence by changing

# Not monotonic, may transiently exceed the representable region
auipcc REG, -1
cincoffsetimm REG, REG, 1096 # -4096 + 1096 = -3000

to

# Monotonic, if the result is representable all intermediates are
s2048pcc REG
cincoffsetimm REG, REG, -952 # -2048 + -952 = -3000

Does this help? Can you elaborate on what you were proposing?

@sorear
Copy link
Contributor Author

sorear commented Jan 27, 2024

CHERIoT uses auipcc with 11-bit alignment (although they just shifted it instead of using insn[30]), so that might be a better approach than I initially considered.

Since CHERIoT requires strict W^X, auipcc cannot be used for writable data there; CHERIoT expects a cgp-based ABI, and provides an auicgp instruction with the same immediate as auipcc but using cgp (c3) as a base. Zcheri will likely be used in W^X and ROM-based systems where data access is predominantly through a pointer. We cannot provide an auicgp instruction without using encoding space reserved for custom extensions. The instruction sequences available in the base ISA for gp-relative data access are:

# 10 bytes, ±2GiB
lui TMP, hi20(symbol)
c.add TMP, TMP, gp
sw DATA, lo12(symbol)(TMP)

# 8 bytes, ±128KiB (+126 -130)
c.lui TMP, hi20(symbol)
c.add TMP, TMP, gp
sw DATA, lo12(symbol)(TMP)

# 4 bytes, ±2KiB
sw DATA, lo12(symbol)(gp)

Andes provides swgp (±512KiB, 4 insn bytes) as an extension; the code size reduction TG may eventually provide a similar instruction although with less encoding space. Zcheri allows the standard sequences, albeit with sizes of 12, 10, and 4 bytes due to non-compressibility of cincoffset. In principle gp-relative accesses have the same representability problem for gp length between 2-3 KiB, but careful choice of offset will always avoid this.

@jonwoodruff
Copy link
Contributor

jonwoodruff commented Jan 28, 2024

@sorear Thank you for your patient description and background!
I was briefly thinking that the problem was that AUIPC(C) was flooring the lower 12-bits, and thus introducing the representability issue. That was a me problem. I think I've got it now.

The simplest solution I can think of is to make IncOffsetImm rd, x0, #####; to decode to IncOffsetImmPCC. That is, take PCC as the source register. (You can always use ADDI to derive from null if you like.). This way, with two instructions, you can reach ±4k (whether the second instruction is a memory operation with a 12-bit immediate, or another IncOffsetImm).

Again, thanks for your patience!

@LawrenceEsswood
Copy link
Collaborator

Just to add to this the possible hardware solutions the last time this was discussed:

  • Add an extra implicit 1 bit to the capability length to double OOB representability.
    Pros: fixes the problem without requiring any software changes and segment is still +-2GB. Also, more OOB space to fix CHERI representability bugs.
    Cons: Wastes or bit, or else requires doubling alignment requirements.
  • Ensure bounds decoding exactly splits OOB representable space in half in the lower and upper address region.
    Pros: fixes the problem without requiring any software changes and segment is still +-2GB
    Cons: larger comparison in decode

To go into details on the second method: the exact way that the OOB space is distributed currently depends on a 3 (?) bit comparison. This results in at least a 1/4 of the OOB space being allocated to both the upper and lower region. I believe a mantissa sized comparison (so a dozen bits, rather than 3) would result in an exact halving and so provide exactly the amount of OOB space required to fix this problem.

@tariqkurd-repo tariqkurd-repo added the question Further information is requested label Jan 31, 2024
@jonwoodruff
Copy link
Contributor

jonwoodruff commented Feb 7, 2024

@LawrenceEsswood Thanks for the reminder!

After doing a bit more thinking about options here, along with @PeterRugg and @gameboo, I have the following opinion:

Most preferable option:
Lawrence's "edge = exactly 1/4 of the representable space below the base". That is, Edge = B - 10'b00.1000.0000 for CLEN=64. The downside of this option is that we will now need 10-bit comparisons to test if new mantissa values are in the representable space, potentially slowing down CIncOffset a little bit. It would be ideal if some ASIC folks could test this out and see if it affects them. @PeterRugg has a rig for testing logic functions in FPGA that we can use as well. @gameboo reminds us that a probably less optimal, but semantically nicer, alternative would be to always center the representable space around the object, which might involve something like Edge = (B + L/2) - 10'b01.0000.0000 to get the edge.

Second most preferable option:
The IncOffsetPCImm instruction that can grab PC and add a 12-bit immediate. This one would use the IncOffsetImm cd, c0, # encoding. If we were to go crazy with this, we could align the arithmetic at 8-bytes and get a ±16k range from PC in a single instruction.

@tariqkurd-repo tariqkurd-repo added bug Something isn't working documentation Improvements or additions to documentation and removed question Further information is requested labels Feb 13, 2024
@tariqkurd-repo
Copy link
Collaborator

Can someone prepare a PR for Lawrence's solution, so that we can analyse the exact implications on the hardware?

@PeterRugg
Copy link
Contributor

Sure: I'll have a go tomorrow.

@davidchisnall
Copy link

CHERIoT uses auipcc with 11-bit alignment (although they just shifted it instead of using insn[30]), so that might be a better approach than I initially considered.

To give a little bit of extra context, there are some extra constraints for CHERIoT here:

In the 64-bit CHERI capability encoding, you can always represent pointers 4 KiB out of bounds. This means that AUIPCC can always give you a pointer either above or below the target and you can adjust back in bounds. On a 32-bit system, that out-of-bounds space comes with a significant cost in representable bounds. The initial proposals for a 32-bit encoding come with far too much wasted memory from imprecise bounds for small embedded systems and so CHERIoT's encoding sacrifices out-of-bounds representability in exchange for increased precision. In particular, we don't guarantee any out-of-bounds representability for underflow.

As a result, we have to be able to guarantee that the displacement provided by AUIPCC is between the current PC value and the target. This requires making the AUIPCC and CIncOffset displacements overlap by one bit. We always round the AUIPCC immediate towards zero to ensure that both the result of AUIPCC and CIncOffset will remain in bounds.

Reducing the range of AUIPCC by a factor of two is not a problem for us, because anyone who actually has 2 GiB of code accessible via PCC is definitely not building an embedded system and is probably not someone who should be allowed to write software.

@PeterRugg
Copy link
Contributor

PeterRugg commented Feb 20, 2024

Just to follow up on the fix in #116, I've implemented it in a version of our bluespec library (CTSRD-CHERI/cheri-cap-lib@853c9c2) and done some early measurements, as well as checking briefly that the representable bounds change enough to fix the issue. incOffset and setAddr see minor timing drops: from about 350MHz on a Stratix 10 FPGA to 300MHz. Obviously to be taken with a heavy pinch of salt, but looks like it's non-negligible, but unlikely to be a huge problem.

tariqkurd-repo pushed a commit that referenced this issue Apr 11, 2024
This is Jon's proposal to change the specification according to
Lawrence's suggestion, addressing
#56.

Ideally we wouldn't merge it until we implement it, at least in our
cheri-cap-lib, to confirm it doesn't affect timing too much to do a
MW-bit comparison rather than a 3-bit comparison on these paths, and to
make sure it fixes the issue.

---------

Co-authored-by: Jonathan Woodruff <[email protected]>
@andresag01
Copy link
Collaborator

This was fixed in #116, so this ticket can now be closed

@davidchisnall
Copy link

from about 350MHz on a Stratix 10 FPGA to 300MHz. Obviously to be taken with a heavy pinch of salt, but looks like it's non-negligible, but unlikely to be a huge problem

A 14% drop in fMax is huge. Was this improved in later versions?

@PeterRugg
Copy link
Contributor

A 14% drop in fMax is huge. Was this improved in later versions?

Agreed, sorry, I think that comment was lacking the context to be useful. That was looking at the most-affected functions in isolation. This change shouldn't make those critical for us, as CSetBounds is down at 250MHz. Regardless, I was hoping that others with more constrained complete designs could have a go and confirm feasibility.

@davidchisnall
Copy link

It's not something that we can test, since the CHERIoT encoding is very different. We completely sacrifice most out-of-bounds encoding for more precision, which is important on memory-constrained devices. It is a shame if the official spec has to diverge, but not a huge problem for the compiler / linker since we already have separate relocation types due to a different compartment model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation For next release
Projects
None yet
Development

No branches or pull requests

7 participants