Hexagon CPU design presents several, unique, challenges to a reverse engineer:
-
Scale: with scalar and vector extensions, Hexagon has more than 2000 distinct instructions.
-
Multi-threading: with four execution slots, this CPU is inherently multi-threaded. Instructions are groups in packets, where each packet has up to four instructions that run in parallel.
-
Data dependencies: instructions in a given packet can reference data produced by other instructions in the same packet. This ".new" register semantics is unique to this variable length instruction CPU.
-
Branch semantics: a packet can have up to two branch instructions. There are many branch types: direct vs indirect, conditional vs unconditional and jump vs call. Modeling this unique branch semantics is rather challenging: only a single branch may be taken at the end of packet processing, subject to some ordering rules.
The way this plugin tackles the complexity described above is through automatic code generation. The plugin has build-time components that parse instruction descriptors, and automatically generate the LLIL lifting code. At runtime, the plugin has components that track packet level data, implement the ".new" and branch semantics.
(Read from top to bottom, left to right).
- Instruction definitions: a dataset that describes instructions encoding,
behavior and semantics. This is part of QEMU's Hexagon target code base,
located at
third_party/qemu-hexagon/
. For example, alu.idef has the following description for theA2_add
instruction:
Q6INSN(A2_add,"Rd32=add(Rs32,Rt32)",ATTRIBS(),
"Add 32-bit registers",
{ RdV=RsV+RtV;})
-
Instruction attributes: a data structure that holds instruction attributes, available to C programs. These header files are generated by a set of scripts in
/third_party/qemu_hexagon/
, and consumed at runtime by the instruction decoder. -
Decoder: decodes a vector of 32b words to a sequence of Hexagon instructions, grouped in a single packet. Decoder fails safe when it cannot decode a given input. Decoder fills the following information for each instruction: instruction id (or tag), operands information (immediate values or registers).
-
Instruction Text Tokens Generator: gen_insn_text_funcs.py parses instruction definitions, and generates code that implements BN's GetInstructionText API for each instruction. This works by parsing the behavior descriptor using a grammar, then transforming the resulting tree into a sequence of
BinaryNinja::InstructionTextToken
s. For example,A2_add
has the following descriptor "Rd32=add(Rs32,Rt32)". This is parsed into tree:
assign_to_op
reg
Rd32
...
call_exp
...
call2
insn add
reg
Rs32
reg
Rt32
and transformed into the following sequence of tokens:
void tokenize_A2_add(uint64_t pc, const Packet &pkt, const Insn &insn,
std::vector<InstructionTextToken> &result) {
result.emplace_back(RegisterToken, StrCat("R", insn.regno[0]));
result.emplace_back(TextToken, " = ");
result.emplace_back(InstructionToken, "add");
result.emplace_back(TextToken, "(");
result.emplace_back(RegisterToken, StrCat("R", insn.regno[1]));
result.emplace_back(TextToken, ",");
result.emplace_back(RegisterToken, StrCat("R", insn.regno[2]));
result.emplace_back(TextToken, ")");
}
-
Instruction Utils: this module implements BN's GetInstructionText API by calling the generated instruction tokenizers. In addition, it implements BN's GenInstructionInfo API: it analyzes decoder's information, and reports packet's branch targets.
-
Packet Database: maps binary addresses to instruction packets. BinaryNinja works at a single instruction level, however, in order to properly model an instruction, knowledge on its neighboring packet instructions is needed.
-
Instruction IL Generator: gen_il_funcs.py parses instruction definitions, and generated code that implements BN's GetInstructionLowLevelIL API for each (supported) instruction. This works by parsing the semantics descriptor using a C-like grammar, then transforming the resulting tree into a sequence of operations on
BinaryNinja::LowLevelILFunction
object. This builds an equivalent symbolic model, and effectively lifts the instruction. For example,A2_add
has the following descriptor "{ RdV=RsV+RtV;}". This C-code is parsed into tree:
multi_stmt
assg_stmt
assg
RdV
expr_binop
RsV
+
RtV
and transformed into the following sequence of operations:
void lift_A2_add(Architecture *arch, uint64_t pc, const Packet &pkt,
const Insn &insn, int insn_num, PacketContext &ctx) {
LowLevelILFunction &il = ctx.IL();
const int RdV = ctx.AddDestWriteOnlyReg(MapRegNum('R', insn.regno[0]));
const int RsV = MapRegNum('R', insn.regno[1]);
const int RtV = MapRegNum('R', insn.regno[2]);
il.AddInstruction(il.SetRegister(
4, RdV, il.Add(4, il.Register(4, RsV), il.Register(4, RtV))));
}
A more involved example is A2_pxort
instruction. It has the following
semantics:
SEMANTICS( \
"A2_pxort", \
"if (Pu4) ""Rd32=xor(Rs32,Rt32)", \
"""{if(fLSBOLD(PuV)){RdV=RsV^RtV;} else {CANCEL;}}""" \
)
This is modeled using IL "if" statements:
void lift_A2_pxort(Architecture *arch, uint64_t pc, const Packet &pkt,
const Insn &insn, int insn_num, PacketContext &ctx) {
LowLevelILFunction &il = ctx.IL();
const int PuV = MapRegNum('P', insn.regno[0]);
const int RdV = ctx.AddDestWriteOnlyReg(MapRegNum('R', insn.regno[1]));
const int RsV = MapRegNum('R', insn.regno[2]);
const int RtV = MapRegNum('R', insn.regno[3]);
{
LowLevelILLabel true_case, done;
il.AddInstruction(il.If(il.Register(4, PuV), true_case, done));
il.MarkLabel(true_case);
il.AddInstruction(il.SetRegister(
4, RdV, il.Xor(4, il.Register(4, RsV), il.Register(4, RtV))));
il.MarkLabel(done);
}
}
-
IL utils: this module implements BN's GetInstructionLowLevelIL API by calling the generated instruction lifters. It lifts all instructions in a packet, and models the packet's branch semantics.
-
Packet Context: is an auxiliary object that tracks all clobbered registers in a packet. This is used by IL utils module.
-
Plugin: program's entry point, it implements and registers the new 'Hexagon' architecture module. Architecture module stores decoded instructions in
packet_db
, calls out toinsn_util
to disassemble instructions, andil_util
to lift packets.
-
QEMU's Hexagon target by Taylor Simpson from Qualcomm Innovation Center.
-
Qualcomm Hexagon V67 Programmer’s Reference Manual, 80-N2040-45 Rev. B, February 25, 2020. Can be downloaded from Hexagon SDK website.
-
Binary Ninja API and documentation.
-
Official BN architecture plugins: arch-x86, arch-arm64, arch-mips.