Skip to content

Commit

Permalink
update typing section
Browse files Browse the repository at this point in the history
  • Loading branch information
mahaloz committed Apr 28, 2024
1 parent 6f4b8c2 commit ece51b0
Show file tree
Hide file tree
Showing 8 changed files with 126 additions and 19 deletions.
22 changes: 14 additions & 8 deletions docs/fundamentals/cfg_recovery/lifting.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# Program Lifting

## Introduction
In program lifting, disassembly is converted into an intermediate language (IL)[^1], which is also referred to as an intermediate representation (IR).
Converting disassembly to an IL allows decompiler developers to make optimizations on the IL level which apply to multiple architectures.

There exist many ILs for analysis of programs, but some of the most notable for decompilation are tied to binary analysis.
There exist many ILs for analysis of programs, but some of the most notable ones for decompilation are tied to binary analysis.
Most binary analysis platforms that have created or used an IL often follow the same techniques but mostly differ in their later use of ILs[^1][^2][^3][^4].
One such use is recompilation of decompilation, which can be made easier by lifting to compiled ILs like LLVM-IR[^5][^6].
One such use is recompilable decompilation, which can be made easier by lifting to compiled ILs like LLVM-IR[^5][^6].

Similar to static analysis, most ILs used in decompilation support some form of static single assignment (SSA) since it simplifies some analyses[^7].

## Example Lifted Program
Below is some example x86 assembly:
Below is some example x86 assembly of a simple C program:
```asm
0000000000001129 <main>:
1129: f3 0f 1e fa endbr64
Expand All @@ -22,10 +25,10 @@ Below is some example x86 assembly:
1143: c7 45 fc 01 00 00 00 mov DWORD PTR [rbp-0x4],0x1
114a: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
114d: 5d pop rbp
114e: c3
114e: c3 ret
```

As an example, it can be lifted to an IL like VEX, the IL used in the angr decompiler:
It can be lifted to an IL like VEX, the IL used in the angr decompiler:
```asm
00 | ------ IMark(0x401129, 4, 0) ------
01 | PUT(rip) = 0x000000000040112d
Expand Down Expand Up @@ -67,12 +70,15 @@ NEXT: PUT(rip) = 0x000000000040113a; Ijk_Boring
...
```

Notice the abstraction of compares and assignments makes the code much more verbose.
As such, it has been cut for brevity.
In the case of VEX, each `t` variable is only ever assigned once, making this SSA form.
You will also notice how verbose every instruction has become.
Even simple `mov` instructions, which assign a value to a register, have much more information now.
The lifted VEX above has been cut for brevity.

[^1]: Song, Dawn, et al. "BitBlaze: A new approach to computer security via binary analysis." Information Systems Security: 4th International Conference, ICISS 2008, Hyderabad, India, December 16-20, 2008. Proceedings 4. Springer Berlin Heidelberg, 2008.
[^2]: Kinder, Johannes, and Helmut Veith. "Jakstab: a static analysis platform for binaries: tool paper." Computer Aided Verification: 20th International Conference, CAV 2008 Princeton, NJ, USA, July 7-14, 2008 Proceedings 20. Springer Berlin Heidelberg, 2008.
[^3]: Brumley, David, et al. "BAP: A binary analysis platform." Computer Aided Verification: 23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings 23. Springer Berlin Heidelberg, 2011.
[^4]: Wang, Fish, and Yan Shoshitaishvili. "Angr-the next generation of binary analysis." 2017 IEEE Cybersecurity Development (SecDev). IEEE, 2017.
[^5]: Gussoni, Andrea, et al. "A comb for decompiled c code." Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 2020.
[^6]: https://github.com/revng/revng
[^6]: Revng. “Revng/Revng: Revng: The Core Repository of the Rev.Ng Project.” GitHub, github.com/revng/revng. Accessed 27 Apr. 2024.
[^7]: Van Emmerik, Michael James. Static single assignment for decompilation. University of Queensland, 2007.
8 changes: 4 additions & 4 deletions docs/fundamentals/cfg_recovery/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ Since decompilation directly relies on this CFG, the recovery of it is considere

This recovery can be broken up into multiple phases, some being optional depending on the decompilation target:

1. [Disassembling](): the conversion of binary code to its mnemonic instructions and operands
2. [Program Lifting](): the conversion disassembly to an intermediate language (IL) for better abstraction
3. [Function Recognition](): the discovery of boundaries defining a function
4. [Indirect Jump Resolving](): the resolution of jumps that have no constant target(s).
1. [Disassembling](/docs/fundamentals/cfg_recovery/disassembly): the conversion of binary code to its mnemonic instructions and operands
2. [Program Lifting](/docs/fundamentals/cfg_recovery/lifting): the conversion disassembly to an intermediate language (IL) for better abstraction
3. [Function Recognition](/docs/fundamentals/cfg_recovery/func_recov): the discovery of boundaries defining a function
4. [Indirect Jump Resolving](/docs/fundamentals/cfg_recovery/jump_res): the resolution of jumps that have no constant target(s).

The second phase, program lifting, is only required if the decompiler aims to be architecture agnostic.
Most decompilers will use an IL to make their later analyses more widely applicable.
Expand Down
103 changes: 103 additions & 0 deletions docs/fundamentals/type_recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Type Recovery
## Introduction
Type recovery is the process of identifying high-level variables and their types from a binary[^1], usually in the form of a CFG.
This wiki groups variable identification with type recovery since they are often intertwined in decompilation[^2].

In most decompilers, the typing system requires a well-formed CFG, usually lifted, on which to perform analysis.
Analysis phases can be thought to work in two refining stages:

1. Variable Identification: discover initial locations and their boundaries
2. Type Constraining: utilizing the variables' uses in the code, make constraints for size and choose a type.

This process is often iterative between constraint building and variable identification.
Variable identification also includes retyping/resizing as more constraints are gathered.

![](/static/img/typing.svg)

## Typing Example
A simple C program is shown below:
```c
int main(int argc, char** argv) {
char* str = argv[1];
puts(str);
}
```
After compiling and disassembling:
```bash
gcc example.c -o example && objdump -D -M intel example | grep "<main>:" -A 12
```

We are left with the following:
```asm
0000000000001149 <main>:
1149: f3 0f 1e fa endbr64
114d: 55 push rbp
114e: 48 89 e5 mov rbp,rsp
1151: 48 83 ec 20 sub rsp,0x20
1155: 89 7d ec mov DWORD PTR [rbp-0x14],edi
1158: 48 89 75 e0 mov QWORD PTR [rbp-0x20],rsi
115c: 48 8b 45 e0 mov rax,QWORD PTR [rbp-0x20]
1160: 48 8b 40 08 mov rax,QWORD PTR [rax+0x8]
1164: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
1168: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
116c: 48 89 c7 mov rdi,rax
116f: e8 dc fe ff ff call 1050 <puts@plt>
```

A naive variable recovery algorithm might do the following:
```c
int main(int a1, long long a2) {
int v1; // rbp-0x14
long long v2; // rbp-0x20
long long v3; // rax
long long v4; // rbp-0x8
v1 = a1;
v2 = a2;
v3 = *(&v2 + 1);
v4 = v3
puts((char *) v4);
}
```
However, since `puts` is known to take a `char *` as the first argument this would allow `v4` to be constrained to be a `char *`.
Back-propagating this type constraint to the earlier variables, we get the following:
```c
int main(int a1, char** a2) {
int v1; // rbp-0x14
char** v2; // rbp-0x20
char * v3; // rax
char * v4; // rbp-0x8
v1 = a1;
v2 = a2;
v3 = v2[1];
v4 = v3
puts(v4);
}
```

## Variable Identification
Variable identification seeks to map memory locations and registers to high-level variables in the targeted language output (usually C).
Early decompilers often mapped variables to locations based on their simple accesses [^2].
Later work has followed up on this by expanding the set of uses supported for a variable location identification [^1].
Identified variables often have many candidates for size (because of their potential type).
These candidates have often been reduced to a single type based on their type sinks[^3], uses that have explicit types like in a call argument.

## Type Constraining
Type constraining is directly linked to variable identification since the use of the variable is affected by its size (arrays vs primitive types).
Previous works in this area have looked at multiple ways to gather and reduce types for variables.
Some of these include: def-use analysis[^1][^5][^6], library awareness[^3][^5][^6], emulation[^4], lattice-solving[^7][^8], and machine learning[^9][^10].

Of these works, typing involving lattice-based methods[^7][^8], also known as push-down typing, is the most popular among modern open-source decompilers.


[^1]: Balakrishnan, Gogul, and Thomas Reps. "Divine: Discovering variables in executables." International Workshop on Verification, Model Checking, and Abstract Interpretation. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007.
[^2]: Mycroft, Alan. "Type-based decompilation (or program reconstruction via type reconstruction)." European Symposium on Programming. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999.
[^3]: Lin, Zhiqiang, Xiangyu Zhang, and Dongyan Xu. "Automatic reverse engineering of data structures from binary execution." Proceedings of the 11th Annual Information Security Symposium. 2010.
[^4]: Slowinska, Asia, Traian Stancescu, and Herbert Bos. "Howard: A Dynamic Excavator for Reverse Engineering Data Structures." NDSS. 2011.
[^5]: Haller, Istvan, Asia Slowinska, and Herbert Bos. "Mempick: High-level data structure detection in c/c++ binaries." 2013 20th Working Conference on Reverse Engineering (WCRE). IEEE, 2013.
[^6]: Jin, Wesley, et al. "Recovering C++ objects from binaries using inter-procedural data-flow analysis." Proceedings of ACM SIGPLAN on Program Protection and Reverse Engineering Workshop 2014. 2014.
[^7]: Noonan, Matt, Alexey Loginov, and David Cok. "Polymorphic type inference for machine code." Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2016.
[^8]: Lee, JongHyup, Thanassis Avgerinos, and David Brumley. "TIE: Principled reverse engineering of types in binary programs." (2011)
[^9]: Zhang, Zhuo, et al. "Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary." 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021.
[^10]: Chen, Qibin, et al. "Augmenting decompiler output with learned variable names and types." 31st USENIX Security Symposium (USENIX Security 22). 2022.
1 change: 0 additions & 1 deletion docs/fundamentals/type_recovery/overview.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/fundamentals/type_recovery/pushdown.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/fundamentals/type_recovery/var_rec.md

This file was deleted.

4 changes: 4 additions & 0 deletions docs/static/img/typing.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 1 addition & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,7 @@ nav:
- fundamentals/cfg_recovery/lifting.md
- fundamentals/cfg_recovery/func_recov.md
- fundamentals/cfg_recovery/jump_res.md
- Type Recovery:
- fundamentals/type_recovery/overview.md
- fundamentals/type_recovery/var_rec.md
- fundamentals/type_recovery/pushdown.md
- fundamentals/type_recovery.md
- Control Flow Structuring:
- fundamentals/cf_structuring/overview.md
- fundamentals/cf_structuring/schema-based.md
Expand Down

0 comments on commit ece51b0

Please sign in to comment.