update typing section

mahaloz · Apr 28, 2024 · ece51b0 · ece51b0
1 parent 6f4b8c2
commit ece51b0
Show file tree

Hide file tree

Showing 8 changed files with 126 additions and 19 deletions.
diff --git a/docs/fundamentals/cfg_recovery/lifting.md b/docs/fundamentals/cfg_recovery/lifting.md
@@ -1,14 +1,17 @@
 # Program Lifting 
 
+## Introduction
 In program lifting, disassembly is converted into an intermediate language (IL)[^1], which is also referred to as an intermediate representation (IR).
+Converting disassembly to an IL allows decompiler developers to make optimizations on the IL level which apply to multiple architectures. 
 
-There exist many ILs for analysis of programs, but some of the most notable for decompilation are tied to binary analysis.
+There exist many ILs for analysis of programs, but some of the most notable ones for decompilation are tied to binary analysis.
 Most binary analysis platforms that have created or used an IL often follow the same techniques but mostly differ in their later use of ILs[^1][^2][^3][^4]. 
-One such use is recompilation of decompilation, which can be made easier by lifting to compiled ILs like LLVM-IR[^5][^6]. 
+One such use is recompilable decompilation, which can be made easier by lifting to compiled ILs like LLVM-IR[^5][^6]. 
 
+Similar to static analysis, most ILs used in decompilation support some form of static single assignment (SSA) since it simplifies some analyses[^7].
 
 ## Example Lifted Program
-Below is some example x86 assembly:
+Below is some example x86 assembly of a simple C program:
 ```asm
 0000000000001129 <main>:
     1129:   f3 0f 1e fa             endbr64
@@ -22,10 +25,10 @@ Below is some example x86 assembly:
     1143:   c7 45 fc 01 00 00 00    mov    DWORD PTR [rbp-0x4],0x1
     114a:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
     114d:   5d                      pop    rbp
-    114e:   c3  
+    114e:   c3                      ret 
 ```
 
-As an example, it can be lifted to an IL like VEX, the IL used in the angr decompiler:
+It can be lifted to an IL like VEX, the IL used in the angr decompiler:
 ```asm
 00 | ------ IMark(0x401129, 4, 0) ------
 01 | PUT(rip) = 0x000000000040112d
@@ -67,12 +70,15 @@ NEXT: PUT(rip) = 0x000000000040113a; Ijk_Boring
 ...
 ```
 
-Notice the abstraction of compares and assignments makes the code much more verbose.
-As such, it has been cut for brevity. 
+In the case of VEX, each `t` variable is only ever assigned once, making this SSA form. 
+You will also notice how verbose every instruction has become. 
+Even simple `mov` instructions, which assign a value to a register, have much more information now. 
+The lifted VEX above has been cut for brevity. 
 
 [^1]: Song, Dawn, et al. "BitBlaze: A new approach to computer security via binary analysis." Information Systems Security: 4th International Conference, ICISS 2008, Hyderabad, India, December 16-20, 2008. Proceedings 4. Springer Berlin Heidelberg, 2008.
 [^2]: Kinder, Johannes, and Helmut Veith. "Jakstab: a static analysis platform for binaries: tool paper." Computer Aided Verification: 20th International Conference, CAV 2008 Princeton, NJ, USA, July 7-14, 2008 Proceedings 20. Springer Berlin Heidelberg, 2008.
 [^3]: Brumley, David, et al. "BAP: A binary analysis platform." Computer Aided Verification: 23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings 23. Springer Berlin Heidelberg, 2011.
 [^4]: Wang, Fish, and Yan Shoshitaishvili. "Angr-the next generation of binary analysis." 2017 IEEE Cybersecurity Development (SecDev). IEEE, 2017.
 [^5]: Gussoni, Andrea, et al. "A comb for decompiled c code." Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 2020.
-[^6]: https://github.com/revng/revng
+[^6]: Revng. “Revng/Revng: Revng: The Core Repository of the Rev.Ng Project.” GitHub, github.com/revng/revng. Accessed 27 Apr. 2024.  
+[^7]: Van Emmerik, Michael James. Static single assignment for decompilation. University of Queensland, 2007.
diff --git a/docs/fundamentals/cfg_recovery/overview.md b/docs/fundamentals/cfg_recovery/overview.md
@@ -6,10 +6,10 @@ Since decompilation directly relies on this CFG, the recovery of it is considere
 
 This recovery can be broken up into multiple phases, some being optional depending on the decompilation target:
 
-1. [Disassembling](): the conversion of binary code to its mnemonic instructions and operands 
-2. [Program Lifting](): the conversion disassembly to an intermediate language (IL) for better abstraction 
-3. [Function Recognition](): the discovery of boundaries defining a function
-4. [Indirect Jump Resolving](): the resolution of jumps that have no constant target(s). 
+1. [Disassembling](/docs/fundamentals/cfg_recovery/disassembly): the conversion of binary code to its mnemonic instructions and operands 
+2. [Program Lifting](/docs/fundamentals/cfg_recovery/lifting): the conversion disassembly to an intermediate language (IL) for better abstraction 
+3. [Function Recognition](/docs/fundamentals/cfg_recovery/func_recov): the discovery of boundaries defining a function
+4. [Indirect Jump Resolving](/docs/fundamentals/cfg_recovery/jump_res): the resolution of jumps that have no constant target(s). 
 
 The second phase, program lifting, is only required if the decompiler aims to be architecture agnostic.
 Most decompilers will use an IL to make their later analyses more widely applicable. 

diff --git a/docs/fundamentals/type_recovery.md b/docs/fundamentals/type_recovery.md
@@ -0,0 +1,103 @@
+# Type Recovery
+## Introduction
+Type recovery is the process of identifying high-level variables and their types from a binary[^1], usually in the form of a CFG.
+This wiki groups variable identification with type recovery since they are often intertwined in decompilation[^2]. 
+
+In most decompilers, the typing system requires a well-formed CFG, usually lifted, on which to perform analysis. 
+Analysis phases can be thought to work in two refining stages:
+
+1. Variable Identification: discover initial locations and their boundaries
+2. Type Constraining: utilizing the variables' uses in the code, make constraints for size and choose a type.
+
+This process is often iterative between constraint building and variable identification. 
+Variable identification also includes retyping/resizing as more constraints are gathered. 
+
+![](/static/img/typing.svg)
+
+## Typing Example
+A simple C program is shown below:
+```c
+int main(int argc, char** argv) {
+    char* str = argv[1];
+    puts(str);
+}
+```
+
+After compiling and disassembling:
+```bash
+gcc example.c -o example && objdump -D -M intel example | grep "<main>:" -A 12
+```
+
+We are left with the following:
+```asm
+0000000000001149 <main>:
+    1149:	f3 0f 1e fa          	endbr64
+    114d:	55                   	push   rbp
+    114e:	48 89 e5             	mov    rbp,rsp
+    1151:	48 83 ec 20          	sub    rsp,0x20
+    1155:	89 7d ec             	mov    DWORD PTR [rbp-0x14],edi
+    1158:	48 89 75 e0          	mov    QWORD PTR [rbp-0x20],rsi
+    115c:	48 8b 45 e0          	mov    rax,QWORD PTR [rbp-0x20]
+    1160:	48 8b 40 08          	mov    rax,QWORD PTR [rax+0x8]
+    1164:	48 89 45 f8          	mov    QWORD PTR [rbp-0x8],rax
+    1168:	48 8b 45 f8          	mov    rax,QWORD PTR [rbp-0x8]
+    116c:	48 89 c7             	mov    rdi,rax
+    116f:	e8 dc fe ff ff       	call   1050 <puts@plt>
+```
+
+A naive variable recovery algorithm might do the following:
+```c
+int main(int a1, long long a2) {
+    int v1; // rbp-0x14
+    long long v2; // rbp-0x20
+    long long v3; // rax
+    long long v4; // rbp-0x8
+    v1 = a1;
+    v2 = a2;
+    v3 = *(&v2 + 1);
+    v4 = v3
+    puts((char *) v4);
+}
+```
+
+However, since `puts` is known to take a `char *` as the first argument this would allow `v4` to be constrained to be a `char *`.
+Back-propagating this type constraint to the earlier variables, we get the following:
+```c
+int main(int a1, char** a2) {
+    int v1; // rbp-0x14
+    char** v2; // rbp-0x20
+    char * v3; // rax
+    char * v4; // rbp-0x8
+    v1 = a1;
+    v2 = a2;
+    v3 = v2[1];
+    v4 = v3
+    puts(v4);
+}
+```
+
+## Variable Identification
+Variable identification seeks to map memory locations and registers to high-level variables in the targeted language output (usually C).
+Early decompilers often mapped variables to locations based on their simple accesses [^2].
+Later work has followed up on this by expanding the set of uses supported for a variable location identification [^1]. 
+Identified variables often have many candidates for size (because of their potential type).
+These candidates have often been reduced to a single type based on their type sinks[^3], uses that have explicit types like in a call argument. 
+
+## Type Constraining
+Type constraining is directly linked to variable identification since the use of the variable is affected by its size (arrays vs primitive types).
+Previous works in this area have looked at multiple ways to gather and reduce types for variables.
+Some of these include: def-use analysis[^1][^5][^6], library awareness[^3][^5][^6], emulation[^4], lattice-solving[^7][^8], and machine learning[^9][^10].
+
+Of these works, typing involving lattice-based methods[^7][^8], also known as push-down typing, is the most popular among modern open-source decompilers. 
+
+
+[^1]: Balakrishnan, Gogul, and Thomas Reps. "Divine: Discovering variables in executables." International Workshop on Verification, Model Checking, and Abstract Interpretation. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007.
+[^2]: Mycroft, Alan. "Type-based decompilation (or program reconstruction via type reconstruction)." European Symposium on Programming. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999.
+[^3]: Lin, Zhiqiang, Xiangyu Zhang, and Dongyan Xu. "Automatic reverse engineering of data structures from binary execution." Proceedings of the 11th Annual Information Security Symposium. 2010.
+[^4]: Slowinska, Asia, Traian Stancescu, and Herbert Bos. "Howard: A Dynamic Excavator for Reverse Engineering Data Structures." NDSS. 2011.
+[^5]: Haller, Istvan, Asia Slowinska, and Herbert Bos. "Mempick: High-level data structure detection in c/c++ binaries." 2013 20th Working Conference on Reverse Engineering (WCRE). IEEE, 2013.
+[^6]: Jin, Wesley, et al. "Recovering C++ objects from binaries using inter-procedural data-flow analysis." Proceedings of ACM SIGPLAN on Program Protection and Reverse Engineering Workshop 2014. 2014.
+[^7]: Noonan, Matt, Alexey Loginov, and David Cok. "Polymorphic type inference for machine code." Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2016.
+[^8]: Lee, JongHyup, Thanassis Avgerinos, and David Brumley. "TIE: Principled reverse engineering of types in binary programs." (2011)
+[^9]: Zhang, Zhuo, et al. "Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary." 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021.
+[^10]: Chen, Qibin, et al. "Augmenting decompiler output with learned variable names and types." 31st USENIX Security Symposium (USENIX Security 22). 2022.
diff --git a/docs/fundamentals/type_recovery/overview.md b/docs/fundamentals/type_recovery/overview.md
diff --git a/docs/fundamentals/type_recovery/pushdown.md b/docs/fundamentals/type_recovery/pushdown.md
diff --git a/docs/fundamentals/type_recovery/var_rec.md b/docs/fundamentals/type_recovery/var_rec.md
diff --git a/docs/static/img/typing.svg b/docs/static/img/typing.svg
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -41,10 +41,7 @@ nav:
     - fundamentals/cfg_recovery/lifting.md
     - fundamentals/cfg_recovery/func_recov.md
     - fundamentals/cfg_recovery/jump_res.md
-  - Type Recovery:
-    - fundamentals/type_recovery/overview.md
-    - fundamentals/type_recovery/var_rec.md
-    - fundamentals/type_recovery/pushdown.md
+  - fundamentals/type_recovery.md
   - Control Flow Structuring:
     - fundamentals/cf_structuring/overview.md
     - fundamentals/cf_structuring/schema-based.md