work in progress
Notes on candidate instructions for RV128 (unofficial).
A ‘candidate’ RISC-V 128-bit ISA is proposed for an ABI that can choose to use either 64-bit or 128-bit pointers. A 64-bit pointer model (offsets from a base register) will provide the benefits of the larger 128-bit integrals while controlling the size of application binaries.
A 64-bit near model is introduced and referred to as LP64FP128
(Long and Pointer 64-bit, Far Pointer 128-bit).
Present day code is already using 128-bit, 256-bit, and 512-bit registers for scientific code (DNA string search), multimedia, device security and many other applications. 2048-bit big integers are in every day use in public key cryptography and 256-bit Eliptic Curve encryption along with the GF(2^128) Galois Field algorithms are in widespread use. The popular examples are ECDHE and AES128-GCM AEAD algorithms employed by Mozilla Firefox, Microsoft Edge, Internet Exporer, Apple Safari and Google Chrome. There are numerous applications that can benefit from the increased bandwidth offered by 128-bit integrals.
However while large integrals are available today, the maximum physical addressable bits on typical processors remains in the order of 32-48 bits (virtual) and 32-44 bits (physical). Future processors at the very high end may soon reach 57-bits although it is likely that for the next two decades, a compact code model that uses 64-bit pointers, in the common case is likely to be more practical on 128-bit CPUs, while at the same time not precluding a pure 128-bit pointer model.
With this in mind an LP64FP128
model is proposed alongside a P128
model that uses the full 128-bit address space for pointers.
Instead of redefining existing C type sizes as was done during the
transtion from ILP32
to LP64
and LLP64
, these two proposals
retain the current meaning of long
, long long
and introduce a
new C type: long long long
that requires no new language keywords.
C / C++ types | ILP32 | LP64 | P128 | LP64FP128 |
---|---|---|---|---|
int | 32 | 32 | 32 | 32 |
long | 32 | 64 | 64 | 64 |
long long | 64 | 64 | 64 | 64 |
__int128 (GCC) | - | - | 128 | 128 |
long long long | - | - | 128 | 128 |
void* | 32 | 64 | 128 | 64 |
void* attribute ((near)) | 32 | 64 | 128 | 64 |
void* attribute ((far)) | 32 | 64 | 128 | 128 |
- riscv32 is
ILP32
- riscv64 is
LP64
(code model is 32-bits, data model is 64-bits) - riscv128 is
LP64FP128
(code model is 32-bits, data model is 128-bits with 64-bit base register offsets) int128_t
anduint128_t
types for<cstdint>
and<stdint.h>
- With P128:
sizeof(uintptr_t) == sizeof(uintmax_t) (*2)
- With LP64FP128:
sizeof(uintptr_t) != sizeof(uintmax_t) (*2)
The majority of the RV128 instruction set is a simple translation from the RV32 and RV64 bit model however the RV128I load and store instructions needs special consideration due to the OP-LOAD encoding space and other issues such as the RISC-V atomics model not being strictly compatible with the C11 atomics model. Also, the current OP-LOAD major opcode has one free instruction based on the I-Type encoding (12-bit offset), so an alternative encoding is considered.
Instruction | Opcode |
---|---|
ADDD | OP-64 |
SUBD | OP-64 |
SLLD | OP-64 |
SRLD | OP-64 |
SUBD | OP-64 |
SRAD | OP-64 |
ADDID | OP-IMM-64 |
SLLID | OP-IMM-64 |
SRLID | OP-IMM-64 |
SRAID | OP-IMM-64 |
Instruction | Opcode |
---|---|
MULD | OP-64 |
DIVD | OP-64 |
DIVUD | OP-64 |
REMD | OP-64 |
REMUD | OP-64 |
Instruction | Opcode | Encoding |
---|---|---|
LR.D | OP-AMO | funct3[14:12]=Q.width |
SC.D | OP-AMO | funct3[14:12]=Q.width |
AMO{*}.D | OP-AMO | funct3[14:12]=Q.width |
Instruction | Opcode | Encoding |
---|---|---|
FLQ | OP-LOAD-FP | funct3[14:12]=Q.width |
FSQ | OP-STORE-FP | funct3[14:12]=Q.width |
F{*}.Q | OP-FP | fmt[26:25]=Q.fmt |
Opcode | Opcode | Encoding |
---|---|---|
FCVT.{F,D,Q}.C | OP-FP | funct5[31:27]={{F,D,Q}.C}, fmt[26:25]={S,D,Q} |
FCVT.{F,D,Q}.CU | OP-FP | funct5[31:27]={{F,D,Q}.CU}, fmt[26:25]={S,D,Q} |
FCVT.C.{F,D,Q} | OP-FP | funct5[31:27]={C.{F,D,Q}}, fmt[26:25]={S,D,Q} |
FCVT.CU.{F,D,Q} | OP-FP | funct5[31:27]={CU.{F,D,Q}}, fmt[26:25]={S,D,Q} |
Given the proposed 128-bit ABI will make large use of register register loads and stores, a register register encoding is proposed that is similar to the AMO encoding and contains slots for 3 registers and 7 function bits instead of 2 registers and 12 immediate bits.
One of the main concerns raised about a potential register register encoding for loads and stores is the requirement for 3 register read ports for stores. This is also required for Fused Multiply Add, and it may be that the additional read port can be justified on larger RV128 designs.
The following table shows Word, Double Word and Quad Word register relative load and store instructions.
Opcode | Operands | Notes |
---|---|---|
LW | flags rd,rs2(rs1) | |
LWU | flags rd,rs2(rs1) | |
LD | flags rd,rs2(rs1) | |
LDU | flags rd,rs2(rs1) | |
LQ | flags rd,rs2(rs1) | |
LQU | flags rd,rs2(rs1) | |
LX | flags rd,rs2(rs1) | can set metadata address bit |
SW | flags rs3,rs2(rs1) | |
SD | flags rs3,rs2(rs1) | |
SQ | flags rs3,rs2(rs1) | |
SX | flags rs3,rs2(rs1) | can set metadata address bit |
AMOCMPSWAP.{w,d,q} | flags rd,rs2,(rs1) |
The following table shows Byte and Half register relative load and store instructions.
- Byte and Half instructions may or may not be supported due to the AMO width encoding.
- Consider far strings (byte and half register relative instructions).
Opcode | Operands |
---|---|
LB | flags rd,rs2(rs1) |
LBU | flags rd,rs2(rs1) |
LH | flags rd,rs2(rs1) |
LHU | flags rd,rs2(rs1) |
SB | flags rs3,rs2(rs1) |
SH | flags rs3,rs2(rs1) |
- Layout equivalent to AMO; funct5 encoding space and 2-bit width is encoding available for the opcode
- Represent C11 atomics using memory flags from fence
pred,succ
to allow consume, not.aq.rl
- Load prefetch temporal hint to preload data into cache, rd = x0
- Load and Store read around and write around temporal hints (cache bybass)
- LX and SX are duplicately defined in addition to the XLEN length loads and stores. LX and SX are XLEN loads and stores of addresses versus integrals (LW,LD,LQ) while functionally equivalent differ in that they do not specify word length being loaded and instead load the native width of the operating mode. This distinction also allows static analysis to distinguish pointer word sized loads.
- Potentiall use two major opcodes to un-alias register slots in 31:27 and 11:7 and increase encoding space
- Decide between using
OP-LOAD-STORE-128
orOP-LOAD-128
andOP-STORE-128
- Increased encoding space would allow code and data pointer metadata bit for LX and SX
- There is precedent for aliasing register slots. The RVC compressed encodings alias rd/rs1
- Flags could optionally be able to indicate metadata. e.g. the "address group bit" that is similar in concept to extended internal state in the FPU register file, or they could be implicitly set by LX and SX.
- Separate instruction for forming address allows an implementation to treat
addresses as a separate group and invalidate metadata address bit based on
operations that are not valid on the address group e.g.
address + offset
is valid whileaddress + address
is invalid. Implementations can load offsets into the lower bits of generel purpose registers and use offsets with register register loads and stores against a base register such asgp
,tp
orsp
. - X-Only code can use immediates to form addresses with far offsets with the address prefix stored in code that is not accessible to memory read primitives.