Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates tdag documentation #6573

Merged
merged 3 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/img/tdag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/tdagtree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
179 changes: 73 additions & 106 deletions docs/tdag.md
Original file line number Diff line number Diff line change
@@ -1,156 +1,123 @@
# TaintDAG file format
# The Tainted Directed Acyclic Graph (TDAG) File Format

The Taint Directed Acyclic Graph (TaintDAG, TDAG) file format is tailored to facilitate fast recording of taint operations.
Please see [our ISSTA paper](https://github.com/trailofbits/publications/blob/master/papers/issta24-polytracker.pdf) for a more formal version (as of July 2024) of the documentation that follows.

It is a binary file format based on sparse files. It consists of a header and a number of subsections. The subsections store information about:
PolyTracker uses static instrumentation placed with several [LLVM passes](../polytracker/include/polytracker/passes) to produce several different but complementary kinds of runtime program information in the TDAG format.

- taint sources: filename and offset information
- taint output log: tainted values written to an output file
- taint graph: the graph of how taint values are unioned from source taint and other unions.
The Tainted Directed Acyclic Graph (TaintDAG, TDAG) binary file format is an abstract representation of the directed acyclic graph of each taint label recorded at execution time and the provenance relationships between them. Each provenance relationship between labels is directed (each label knows about its two most immediate ancestors, if it's not a source label). The acyclic nature of this graph is a result of processing input bytes and tracing those processing operations during parsing. Here's an idea of what we're talking about:

Because of the sparse layout it is very well suited for memory mapping (via `mmap()`) directly into the instrumented process address space.
<p align="center">
<img src="img/tdagtree.png" alt="The Directed Acyclic Graph part of a TDAG. This is incidentally a figure from our ISSTA paper. Check out that paper if you want more diagrams." height="400"/>
</p>

## Taint Sources, Unions and Ranges
_Figure: an idealized Tainted Directed Acyclic Graph. Increase in color saturation indicates the accumulation of data flow taint._

Whenever data is read from an input file, the data entering the program is labeled as source taint. Information about which file and at what offset is kept. This is the only way taints can originate in a program.
We store PolyTracker traces as TDAG files so they can be analyzed after they are recorded. Whenever a PolyTracker-instrumented binary runs, that binary produces a new TDAG. Unlike other information flow tracking tools, PolyTracker does not currently do any "online" analysis at runtime! We store PolyTracker traces in the TDAG format so that we can "post-hoc" or "offline" conduct sampled analyses, differential (comparative) analyses between traces, and other types of analyses that are not possible at runtime. Conveniently enough, this also means we don't need to spend runtime tracing memory on analysis operations and can separately optimize our analyses.

As the instrumented binary operates on the now labeled data, the associated taint labels need to reflect those operations. E.g. on addition of two tainted values there should be a new taint label associated with the result. The new label should reflect the union of the operand labels.
Parsing code that can read TDAGs and produces Python data structures from the sections of TDAG files primarily lives in [taint_dag.py](../polytracker/taint_dag.py). This "read side" code directly corresponds to the C++ header-file definitions we use to write a TDAG file as we build up data flow information, which can be found in [polytracker/include/taintdag](../polytracker/include/taintdag).

```C
uint32_t a = ...;
uint32_t b = ...;
uint32_t result = a + b;
```
The size of a given TDAG depends not only on the size of the input traced, but also on the complexity of the operations the instrumented software did with the input bytes. Some TDAGS, understatement, are quite large. To reduce the number of file system operations needed, the `polytracker` Python library memory [maps the contents of the TDAG into process memory](https://github.com/trailofbits/polytracker/blob/master/polytracker/taint_dag.py#L451) (via `mmap()`). This makes it possible to work with ``raw'' TDAGs exactly like one might work with an array or a buffer. While we currently use a number of buffer copies for operations on the TDAG, this strategy will eventually prove unsustainable for tracing more complex algorithms (such as zlib decompression) run on larger inputs.

For the above case the taint label of `result` represents a union of the taint labels of `a` and `b`.
## Layout

If the taint labels considered for a union are adjacent (number wise), e.g. two consecutive source taint bytes, a range is created. Unions and ranges occupy the same amount of storage. The main difference is that a range can be extended to become a larger range.
Each TDAG includes a number of subsections; the largest of these is typically the labels section, followed by the control flow log section. The following image is an idealized representation of what you'll find in an average TDAG (as of July 2024). Note that this drawing doesn't include directly, for example, the bitmap section.

Consider the following operation on source bytes
<p align="center">
<img src="img/tdag.png" alt="The TDAG. This is incidentally a figure from our ISSTA paper. Check out that paper if you want more diagrams." height="400"/>
</p>

```C
uint8_t src[1024];
// read source taint
uint32_t val = *(uint32_t*)src;
```
_Figure: Layout of an idealized TDAG. Increase in color saturation indicates the accumulation of data flow taint._

In this example `val` should be labeled with the union of the four consecutive source taint lables. In this case a range is instead created representing all four labels.
Every [section](../polytracker/include/taintdag/section.h) in the TDAG has a predefined size, entry size, and optionally also spacing/padding between entries. The sections available in a TDAG file are accessed by tag by the class `TDFile` in [taint_dag.py](../polytracker/taint_dag.py).

The main motivation for introducing ranges is to allow for efficient membership testing. If a taint label is already included in a range of taint values, the range can be reused. It is possible to unfold the range into a tree of unions and walk the tree, but it requires more computation.
Some specifics:

```C
uint8_t src[1024];
// read source taint
uint32_t val1 = *(uint32_t*)src;
uint32_t val2 = val1 + src[1];
```
- [File Header](../polytracker/include/taintdag/outputfile.h): this header consists of the TDAG magic bytes, and then "meta" information used to determine the number, type, and contents of the sections that follow FileHeader. This is what `TDFile` is going to interpret to figure out what to do with the rest of the file contents.
- [Labels](../polytracker/src/taintdag/) consists of the tainted information flow labels recorded at runtime
- [Sources](../polytracker/src/taint_sources/taint_sources.cpp) contains source labels (byte offsets into the input)
- The Source Label Index is a bitmap that defines how to index the sources section.
- [Sinks](../polytracker/include/taintdag/sink.h) contains sink labels (representing bytes of the output)
- [Strings](../polytracker/include/taintdag/string_table.h) todo(kaoudis) the string table is used in conjunction with the fnmapping to put together an earlier version of the control flow log used for grammar extraction
- [Functions](../polytracker/include/taintdag/fnmapping.h) todo(kaoudis) this contains an early version of the function list part of the control flow log used for grammar extraction
- [Events](../polytracker/include/taintdag/fntrace.h) todo(kaoudis) this contains an early version of the entry and exit events used to structure the control flow log
- [Control Flow Log](../polytracker/include/taintdag/control_flow_log.h): this consists of the function entry and exit records we need to reconstruct the call stack that data flow passed through.

In this slightly extended example the label of `val2` can be made equal to `val1`. It depends on the exact same source labels. Ranges make checking for such cases more efficient.
## TDAG Contents

## Affects control flow
You'll notice the TDAG doesn't just include data flow labels, but also has the other information we collect as well. We use LLVM passes to place several different kinds of static instrumentation at build time. Via this instrumentation, the PolyTracker library will collect different, complementary, aspects of runtime information flow.

In addition to being Source-, Union- or Range-Taint, each value is also marked if it affects control flow. The basic example is a value with taint label `L` is read from file, compared against another value, and a branch is taken based on the result. Whenever the conditional branch is executed, the taint with label `L` is marked as affecting control flow.
We track a couple different kinds of information flow, and record them all together in different sections of the same file:

<!-- TODO(msurovic): This paragraph is a bit clunky, but I don't know how to rephrase it. -->
- dynamic information flow trace labels (taint labels)
- "affects-control-flow" label tags (represented with the letter C in the above figure)
- function entries and exits that correspond to the function log for callstack reconstruction purposes

Affects control flow propagates through unions and ranges. This means that if a value with a union or range label `W` affects control flow, then each taint label represented by `W` is in turn marked as affecting control flow.
We also include the following more static data:

## File format
- the function log that corresponds to the section of the control flow graph our data flow trace followed
- the index of source bytes, and a mapping between source bytes and initial labels
- the index of sink bytes

The general layout of the file is as follows:
Once the TDAG of interest has been read in, we can re-construct information that is useful for analyses, such as:

```
[FileHdr][FileDescriptorMap][TDAGMapping][SinkLog]
```
- provenance relationships for each intermediate label (these tell you how we got a particular label and what data it descends from; these relationships also can be leveraged to determine what other labels descend from the label of interest), and
- the control flow log to label mapping for each intermediate label (this tells you where/when we recorded the label during execution).

### FileHdr
## Labels

```C
struct FileHdr {
uint64_t fd_mapping_offset;
uint64_t fd_mapping_count;
uint64_t tdag_mapping_offset;
uint64_t tdag_mapping_size;
uint64_t sink_mapping_offset;
uint64_t sink_mapping_size;
};
```
Each label is currently of size `uint32_t`, but because we store other data that describes the label in a bit vector right alongside it, we use an `uint64_t` ("`storage_t`") to hold all of this. See `label_t` and `storage_t` in [taint.h](../polytracker/include/taintdag/taint.h). See [encoding.h](../polytracker/include/taintdag/encoding.h) comments in [encoding.cpp](../polytracker/src/taintdag/encoding.cpp) for some description of what actually goes in a "taint label".

Each offset is relative to the start of the file.
As the instrumented binary operates on the now labeled data, the associated taint labels need to reflect those operations. For example, if the instrumented program happens to add two tainted values, we will create a new taint label to represent the addition's result. The new label should reflect the union of the operand labels.

### FDMappingHdr
### Source Labels

At `fd_mapping_offset` there is an array of `FDMappingHdr` structures. Length of the array is given by `fd_mapping_count`.
Whenever the instrumented program reads in data, by default, we label each byte of that input as a _taint source_. PolyTracker can also use either stdin and all of argv as sources of taint, but you'll need to set either `POLYTRACKER_STDIN_SOURCE` or `POLYTRACKER_TAINT_ARGV` to make that work. When we work with sources on the "read" or analysis side of PolyTracker, each `source` has a reference to the input file it came from.

```C
struct FDMappingHdr {
int32_t fd;
uint32_t name_offset;
uint32_t name_len;
uint32_t prealloc_begin;
uint32_t prealloc_end;
};
```
### Shadow Memory Usage

Each of the `FDMappingHdr` structures has an implicit index. Subsequent structures in the TDAG use that index to refer to each `FDMappingHdr`.
As a note for the unwary, how we use shadow memory differs a bit from how you might expect DFSan to operate. We primarily cache intermediate labels there: a given label needs to be written out to the TDAG file before we can add to shadow memory a new descendant label resulting from a range or union operation involving that initial label.

```C
uint32_t a = ...;
uint32_t b = ...;
uint32_t result = a + b;
```
[FDMappingHdr][FDMappingHdr]...[FDMappingHdr]
Index 0 Index 1 ... Index N
```

The `fd` field is the file descriptor as seen at runtime. The `name_offset` is the offset at which the name associated with `fd` is located in the TDAG file. The `name_len` is the length of the file name at `name_offset`. The `prealloc_begin` and `prealloc_end`, if not zero, indicate a source taint sequence of adjacent labels that was preallocated for this file. The idea is to have as many contiguous labels as possible for the same file, aiming at maximising the number of ranges generated.

### SourceTaint, UnionTaint and RangeTaint - the actual TDAG
For the above case the taint label of `result` represents a union of the taint labels of `a` and `b`. The labels of `a` and `b` will be written out to the labels section of the TDAG before we add the label corresponding to `result` to shadow memory.

At `tdag_mapping_offset` there is `tdag_mapping_count` of `uint64_t` entries. Each entry denotes either a Source-, Union- or Range-Taint. Their relative index is the taint label. Index zero is unused as it denotes 'not tainted'. The general layout of the `uint64_t` value is:

```
| x y zzz...z |
63 0
```
### Range Versus Union

Bits `x` and `y` are common for the three kinds of taint values. The value `x` is set to one to indicate that it is a source taint and to zero if it is a Union- or Range-Taint. The value `y` is set to one if the taint affects control flow and zero if not.
If the taint labels considered for a union are adjacent in memory, for example, two consecutive source taint bytes, we create a _range_ label instead. Unions and ranges occupy the same amount of storage. The main difference between these two label creation operations and the resulting label types is that a range can be "extended" to become a larger range.

For SourceTaint, the following layout is used:
Consider the following operation on source bytes

```
| x y ooo oo iiiiiiii |
63 61 7 0
```C
uint8_t src[1024];
// read source taint
uint32_t val = *(uint32_t*)src;
```

Here, `o` denotes the offset in the source file. The `i` denotes the source file index, referring to the `FDMappingHdr` index and structures previously described.
In this example `val` should be labeled with the union of the four consecutive source taint lables. In this case a range is instead created representing all four labels.

If `x` is zero, the value is either a Union-Taint or a Range-Taint. They share a common layout
The main motivation for introducing ranges is to allow for efficient membership testing. If a taint label is already included in a range of taint values, the range can be reused. It is possible to unfold the range into a tree of unions and walk the tree, but that requires more computation.

```
| x y vvv ... v www ... w |
63 61 30 0
```C
uint8_t src[1024];
// read source taint
uint32_t val1 = *(uint32_t*)src;
uint32_t val2 = val1 + src[1];
```

Here, `v` and `w` denotes unsigned integers referring to other taint values in the TDAG structure. To differentiate between a range and a union the following rule is used:

```
v < w => RangeTaint
w < v => UnionTaint
w == v => undefined
```
In this slightly extended example the label of `val2` can be made equal to `val1`. It depends on the exact same source labels. Ranges make checking for such cases more efficient.

### SinkLog
## `affects_control_flow`

The sinklog is a sequence of records logging what tainted values have been written to output files. Each entry in the sinklog index is defined as
When we find that the value tagged with a particular taint label affects control flow, we set a bit on the label indicating such on the write side (see [taint.h](../polytracker/include/taintdag/taint.h) for the definition and [labels.h](../polytracker/include/taintdag/labels.h) for how we do this). On the read side, we represent this property for labels using the Boolean value `affects_control_flow` (see [taint_dag.py](../polytracker/taint_dag.py)).

```C
struct SinkLogEntry {
uint8_t fdidx;
uint64_t offset;
uint32_t label;
};
```
If a parent of a given label has `affects_control_flow` set, we also set `affects_control_flow` for the label. This means that `affects_control_flow` is a distinctly tracked type of tainted information flow from our main data flow! As well, note that if a value with a union or range label `W` affects control flow, then each taint label represented by `W` is in turn marked as affecting control flow.

NOTE: The structure is assumed to be packed and occupy `1 + 8 + 4 = 13` bytes.
In this structure, the `fdidx` is an index into the `FDMappingHdr` array described previously. The `offset` is the offset in the output file represented by `fdidx`. Finally the `label` is the taint label associated with the data written and is thus an index into the TDAG structure at `tdag_mapping_offset`.
A simple example of a control-flow-affecting data flow operation is: a value with taint label `L` is read from file, compared against another value, and a branch is taken based on the result. Whenever the conditional branch is executed, the taint with label `L` is marked as affecting control flow.

## Portability

The file format is currently not portable. There is no effort made to store values in anything other than the native endianess.
We store all values in their native endianness. This file format is currently not portable.
Loading