Skip to content

Instruction Encoding Notation

AndyGlew edited this page Aug 13, 2020 · 2 revisions

Instruction set design requires notations for instruction encodings.

Glew legacy instruction set encoding bitstring notation

For many years (since 1983 or thereabouts), I have been using notation such as that below for instruction encoding bit strings:

PREFETCH.64B.R: imm12.rs1:5.110.rd=00000.0010011, e.g. ORI with RD=x0

In these bitstrings

  • 0 and 1 respond to bit values

  • fields are specified by rs2:5, rd=00000, etc

    • i.e. fieldname:width
    • e.g. fieldname=value (width implied)
  • punctuation is used to improve readability, such as period ".", underscore "_", and comma ","

Bit positions are numbered ... well, for x86 with bit 0 on the left, for RISC-V with bit 0 on the right. Whatever is appropriate for your ISA.

Notation Extensions

Non-Binary

Strings of 0s and 1s encode their own length, but can be long.

Hex and octal numbers encode their own links only if multiples of 4 or 8, respectively.

fieldname:length=nonBinaryNumber, e.g. opex:5=0x1F <=> opex=11111

Handles hex, octal, decimal, and whatsoever formats you use: 0x[0-F]+, H[0-f]+, 0[0-7]+, [0-9]+. for decimal, whatever.

Macros

I usually use this in combination with a macro preprocessor, originally CPP, so that common patterns like rd:5 can be abbreviated. occasionally I have written a customized macro preprocessor to handle notation extensions such as...

Alternatives and Coordinated alternatives

  • e.g. {op:ADD,SUB} opcode={op:00,01}.rd:5.rs2:5.{regOrImm=0.rs1:5.opex:14=0,regOrImm=1.imm:14}
  • expands to
    • ADD opcode=00.rd:5.rs2:5.regOrImm=0.rs1:5.opex:00000000000000
    • ADD opcode=00.rd:5.rs2:5.regOrImm=1.imm:14

Basically, {a,b}{x,y} is Cartesian product, as in C-shell ax ay bx by

And alternatives can be coordinated by labels, so that {L:a,b}{x,y}{L:A,B} => axA ayA bxB byB

These notations are useful when the encodings do not form Boolean cubes - when there are holes cut out

Expressions and Constraints

E.g. CMO.UR 000.rd:5.rs2:5.rs1:5 with rd==rs1 and rd!=0

E.g. ADD 010.rd.rs2.rs1 with rs2 > rs1 since commutative, allowing rs2 <= rs1 to be used for other instruction encodings.

(E.g. see MIPSr6, where we were very tight on encoding space.)

Probabilities

Either inline, or via separate metadata (e.g. instruction profiles) to guide optimization.

Other State

The bitstring patterns do not need to be restricted to bits from the instruction stream. They can include CSR (Control and Status Register) moded bits.

Register and Other non-Instruction Encodings

These notations can be used to describe CSR encodings, ...

Historical Alternative Notations

From time to time a pleasant alternative notation is reinvented:

  • ADD 0000ddddaaaabbbb
  • with spaces ADD 0000 dddd aaaa bbbb Associated with pseudocode rd := ra + rb

Basically, the register name rX corresponds to bits that contain X it the bistring.

This is especially nice when tables of appropriate width can be created.

Compare to equivalent in my now more common notation

  • ADD 00.rd:5.ra:5.rb:5
  • or with macros ADD 00.rd.ra.rb

I tend to use the latter notation more often nowadays because it scales better:

  • fieldnames can be more than a single letter
  • punctuation improves readability - and if you use the 00.ddd notation, punctuation destroys alignment if formats are variable
  • fieldnames like r0 and r1 are supported

RISC-V instruction encoding metadata

Elsewhere in the RISC-V toolchain a similar notation is used, with additions such as allowing blanks to separate fields, and allowing fields to be specified out of order by specifying bit positions such as rd=5..9

AW: https://github.com/riscv/riscv-opcodes is where the current instruction encoding metadata lives. See comment at the top of https://github.com/riscv/riscv-opcodes/blob/master/opcodes-rvv for description of notation.

Wishlist

TBD: unify these notations.

Centralized vs Distributed

The RISC-V and Glerw formats above have the advantage that they are decentralized.

Compare this to hand written instruction decoders, where handwritten code first decodes bits 0 to 5, second ... I.e. Where the knowledge of the instruction encodings is centralized in a single place.

The decentralized formats have the advantage that they can be used to generate centralized formats. But you can quickly add instructions that violate the assumptions of the centralized decoder representation, and let the tools regenerate the decoders.

GLEW OPINION: once you have such tools you never want to go back.

Tools

Many of us have written tools, taking in such notations and generating

  • instruction decoders for logic synthesis, simulators, back handlers and disassemblers
    • often in conjunction with systems for specifying instruction semantics
  • assemblers translating strings to machine code
  • RIT (Random Instruction Test) generators
  • pretty printers and diagrams
    • e.g. per-instruction encoding bitfield diagrams
    • e.g. "opcode maps" - the sort of hierarchical Karnaugh-map like tables taht are in so many manuals.
  • consistency checkers - e.g. no overlap
  • encoding allocators and logic minimizers

Unfortunately for me, I have left most of such tools I have written behind my various employers. My earliest tools, from my BEng and MSEE, are quite embarrassing.
Nevertheless, one of my earlier tools, which I used fixed width ISAs like Gould and MIPS, is

TBD: I supervised another student's project who did this for x86's variable length instruction encodings.

The RISC-V toolchain referenced above emits LaTeX and decoders. The LaTeX "tables" are essentially lists of encodings. By "pretty" I mean the sort of table that looks like a hierarchy of Karnaugh maps, as is traditional. Other tools also generate nice diagrams of per-instruction encodings and fields.)

TBD: somewhere I have email describing how to generate the pretty opcode maps. TBD: post that email - and then find a weekend to code it up again.

Clone this wiki locally