This document describes the structure of the McSema codebase, where to find things, and how the various parts of the McSema toolchain fit together.
There are three high-level steps to using McSema:
- Disassembling a program binary and producing a CFG file
- Lifting the CFG file into LLVM bitcode
- Compiling the LLVM bitcode into a runnable binary
First, let's familiarize ourselve with essentials of the file layout of McSema.
┌── mcsema
│ ├── Arch
│ │ ├── ... Architecture-neutral files
│ │ └── X86
│ │ ├── ... X86-specific files
│ │ ├── Runtime
│ │ │ ├── ...
│ │ │ └── State.h X86 `RegState` structure
│ │ └── Semantics
│ │ ├── ADD.cpp Semantics for ADD instruction
│ │ └── ... Other semantics code
│ │
│ ├── BC
│ │ ├── Lift.cpp Bitcode lifting code
│ │ └── Util.cpp Bitcode generation utilities
│ │
│ ├── CFG
│ │ ├── CFG.cpp CFG file deserialization code
│ │ └── CFG.proto CFG file format description
│ │
│ ├── cfgToLLVM Legacy translation routines
│ │ └── ...
│ │
│ └── Lift.cpp Entrypoint of `mcsema-lift`
│
├── tools
│ └── mcsema_disass
│ ├── ida
│ │ └── get_cfg.py IDA script to produce CFG files
│ └── __main__.py Entrypoint of `mcsema-disass`
│
└── third_party
└── llvm LLVM source code
The first step to using McSema is to disassemble a program binary and produce a CFG file. The program that disassembles binaries is mcsema-disass
.
mcsema-disass
is organized into a frontend and backend. The front-end command accepts a --disassembler
command-line argument that tells it what disassembly engine to use. In practice, this will always be a path to IDA Pro.
The front-end is responsible for invoking the backend and disassembly engine. The IDA Pro backend is an IDA Python script invoked by idal
or idal64
, and will output a CFG file.
The most important high-level structures recorded in the CFG file are:
Function
: functions in the binary with concrete implementations. TheFunction
message contains all basic blocks and instructions of the function. A common example of this would be a program'smain
function.ExternalFunction
: functions called but not defined by the program. A common example of this would be libc functions likemalloc
,strlen
, etc.Data
: Data stored in the program binary. This includes things like global variables andstatic
storage duration-defined variables in C/C++ code.ExternalData
: data referenced but not defined by the program. An example of this would be thegetopt
C library'soptind
variable. You can things of these being likeextern
-declared global variables.
mcsema-lift
has different ways of turning each of the above structures into LLVM bitcode.
The mcsema-lift
command is used to lift CFG files to LLVM bitcode. Two important arguments to mcsema-lift
are:
--os
: The operating system of the code being lifted. In practice, each binary format is specific to an operating system. ELF files are for Linux, Mach-O files for macOS, and DLL files for Windows.--arch
: The architecture of the code being lifted. This is one ofx86
oramd64
.
Both of the above arguments instruct the lifter on how to configure the bitcode file.
McSema self-initializes before any bitcode is produced. The first initialization step is InitArch
. This function uses the values passed to the --os
and --arch
command-line flags to set up a target triple and data layout for the bitcode file. The triple and data layouts tell LLVM about things like the size of pointers and calling conventions.
InitArch
also initializes things like the instruction disassembler and [dispatcher]((/mcsema/Arch/X86/Dispatcher.cpp). McSema uses LLVM's built-in instruction disassembler. The disassembler converts bytes of machine code into MCInst
objects. MCInst
instructions are labelled with an "op code." McSema has a function for lifting each op code. An instruction dispatcher is used to map an instruction's op code to an function that produces bitcode.
Machine code architecture-specific functionality is isolated into the Arch directory and its sub-directories. Architecture-specific functions are prefixed using Arch
. For example, ArchRegisterName
is a function that returns the name of a register. This function dispatches to X86RegisterName
when the value passed to the --arch
command-line option is x86
or amd64
.
McSema decodes the CFG file (passed to --cfg
) after all architecture- and OS-specific initialization is performed. The ReadProtoBuf
reads the contents of the CFG file produced by mcsema-disass
, and converts the various CFG components in-memory data structures.
There are four steps involved:
-
DeserializeExternFunc
:ExternalFunction
messages from the CFG file are decoded intoExternalCodeRef
data structures.External functions cannot be modeled like translated functions, and the control flow recovery tool needs to know the calling convention and argument count of these external functions. The calling convention and argument count are specified by an external function map file. There is a default external function map for both Linux and Windows. in
tests/std_defs.txt
. -
DeserializeNativeFunc
:Function
messages from the CFG file are decoded intoNativeFunc
data structures. Each one of these functions will be lifted into bitcode.The
Function
message contains one or moreBlock
messages. These represent basic blocks of machine code.Block
messages are decoded byDeserializeBlock
intoNativeBlock
objects. Each one of these objects will produce one or morellvm::BasicBlock
objects.Each
Instruction
message contained in theBlock
is decoded byDeserializeInst
into aNativeInst
object. TheNativeInst
object is produced by decoding the raw bytes of the instruction using theDecodeInst
.DecodeInst
uses the architecture-neutralArchDecodeInstruction
function to decode the instruction bytes into anllvm::MCInst
object.The
NativeInst
class, augmentsllvm::MCInst
with data not needed by LLVM itself. For instance,NativeInst
records instruction prefixes, whether the instruction is the last in a block, whether any others point to it, whether it references external data, etc. All of themc-sema
code will operate onNativeInst
instances, and notllvm::MCInst
. -
DeserializeData
:Data
messages from the CFG file are decoded intoDataSectionEntry
objects. Each one of these objects will produce the equivalent of global variables.McSema does not always know the content or structure of the data sections within a binary. As such, it needs to preserve the content of those sections (almost) verbatim, treating them as mostly opaque blobs.
Data sections are translated to packed LLVM structures. Representing data as a packed structure lets us reference individual data items and to insert references to other code and data sections, that will be correctly relocated in bitcode.
There are three ways of handling
DataSectionEntry
items. The item can be a data blob. If so, then it is added as a structure member. The item can be a function reference. If so, then a reference to the function is looked up in the module, and if found, is added as a structure member. Lastly, the item can be a data reference. If so, then a reference to the data section of the target is found in the module, and an offset from module start to item start is added to the data section base. This opaque address computation is crude but necessary - a data section may reference another data section which is not yet populated. -
DeserializeExternData
:ExternalData
messages from the CFG file are decoded intoExternalDataRef
objects. Each one of these objects is treated by the bitcode as externally-defined global variables.
The LiftCodeIntoModule
does the bulk of the lifting work. The function is mostly self-documenting:
bool LiftCodeIntoModule(NativeModulePtr natMod, llvm::Module *M) {
InitLiftedFunctions(natMod, M);
InitExternalData(natMod, M);
InitExternalCode(natMod, M);
InsertDataSections(natMod, M);
return LiftFunctionsIntoModule(natMod, M);
}
InitLiftedFunctions
creates one llvm::Function
for every NativeFunction
data structure.
InitExternalData
creates global variables for every ExternalDataRef
object.
InitExternalCode
creates external llvm::Function
declarations for every ExternalCodeRef
. This involves declaring the functions with the correct prototypes that include the OS-specific calling convention, and argument and return types.
InsertDataSections
creates the global packed structs for each DataSectionEntry
item. The translation happens via two nested loops. The first loop iterates over every data section in the CFG. The second loop, found in dataSectionToTypesContents
iterates over every item in the data section and fills their content into the bitcode file.
LiftFunctionsIntoModule
lifts the actual instructions into the llvm::Function
s created by InitLiftedFunctions
. The first step to lifting each function is InsertFunctionIntoModule
. This function starts by creating one llvm::BasicBlock
for each of the function's NativeBlock
s. A special entry basic block is added to the llvm::Function
. This block creates one variable for every machine code register. The creation of the local references into the register state is done by ArchAllocRegisterVars
.
The LiftBlockIntoFunction
function then populates the empty llvm::BasicBlock
s with bitcode emulating the machine code. It executes LiftInstIntoBlock
for every NativeInst
in the NativeBlock
object. This function discovers the architecture-specific instruction lifter using the ArchGetInstructionLifter
function. This function looks up the opcode of the llvm::MCInst
in the instruction dispatcher.
The ArchLiftInstruction
invokes the instruction-specific lifter function. This function may do some architecture-specific pre-processing before lifting the instruction.
This section will briefly cover raw instruction translation. For more details on the translation functions, see the ADDING AN INSTRUCTION document.
The LLVM disassembler produces its own opcodes, with each operand combination having its own opcode. For instance, the x86 ADD
instruction has at least 31 different LLVM opcodes, with names like ADD32ri
(add a 32-bit immediate to a 32-bit register), ADD8mi
(add an 8-bit immediate to an 8-bit memory location), etc.
All of these opcodes will have very similar translations, only different by operand order and memory width. To simplify translation, the core of the instruction is usually a templated function based on width that operates on two llvm::Value
pairs that act as operands. For the ADD
instruction, this is doAddVV
. Other helper functions exist to convert immediate values and memory addresses to llvm::Value
objects and to write the result of the addition to the correct destination (e.g. memory or register). Examples of these helper functions are doAddRI
, doAddMI
, etc.
All the translation functions must have the same prototype and share lots of boilerplate code. To make writing them easier, there are several helper macros defined in mcsema/BC/Util.h
:
GENERIC_TRANSLATION(NAME, THECALL)
: Create a function namedtranslate_<NAME>
that executes the statementTHECALL
.GENERIC_TRANSLATION_MEM(NAME, THECALL, GLOBALCALL)
: LikeGENERIC_TRANSLATION
, but checks if the instruction references code or data. If so, executeGLOBALCALL
instead ofTHECALL
.GENERIC_TRANSLATION_32MI(NAME, THECALL, GLOBALCALL, GLOBALIMMCALL)
: Used only for instructions that have two operands: 32-bit immediate and a memory value. Like GENERIC_TRANSLATION_MEM, but checks which operand references code or data. If its the immediate, executeGLOBALIMMCALL
.OP(x)
: Shorthand forinst.getOperand(x)
ADDR(x)
: Shorthand forgetAddrFromExpr
with common arguments.ADDR_NOREF(x)
: Shorthand forgetAddrFromExpr
where it is certain the function will never reference a data variable, but needs to compute a complex address expression.
Many x86 instructions require complex address computation due to complex addressing modes. The helper following helper functions are defined in cfgToLLVM/x86Helpers.cpp
and are used to do address computation:
getAddrFromExpr
: Computes a Value from a complex address expression such as[0x123456+EAX*4]
. If the expression references global data, use that in the computation instead of assuming values are opaque immediates.GLOBAL
: Shorthand forgetAddrFromExpr
.GLOBAL_DATA_OFFSET
: Used when it is certain that the instruction must reference code/data, and not an opaque immediate.
Using these macros, it is then possible to define a translation function. For instance, ADD32ri
is defined as:
GENERIC_TRANSLATION(ADD32ri, doAddRI<32>(ip, block, OP(0), OP(1), OP(2)))
That code will define a function named translate_ADD32ri
, and call doAddRI<32>(ip, block, OP(0), OP(1), OP(2))
to do the translation. The result will be stored in operand 0, and the two addends are operand 1 and operand 2.