Deparsing Paper

Decompiling in debugging and location reporting

This was a the original draft for https://github.com/rocky/python-uncompyle6/wiki/Deparsing-Paper.pdf

Table of Contents

Decompiling in debugging and location reporting
Mechanics
- Code Reduction
- Testing
Summary

Abstract:

Decompiling is often used in conjunction with recovering lost source-code, or reverse engineering code when you don't have access to the source code.

Here I describe another use: in tracebacks, debuggers, core dumps and similar places where accurate position-reporting is desired. I show how more conventional methods of source-code reporting are vague and ambiguous.

Decompilation is the most accurate way to describe where the program currently is, in high-level terms familiar to the programmer. It is not in widespread use because it is unfamiliar and labor-intensive. This paper shows how to ameliorate these problems by giving a pattern for writing decompilers. I hope to encourage the incorporation of decompilation into tracebacks, debuggers, core dumps, and so on.

I also show how decompilation advances the art of debugging optimized code.

Although effecting a pervasive change in a compiler is arduous and error-prone, the decompiler can be developed somewhat independently.

Next, I describe a pipeline for a Python decompiler, and show similarities with a conventional programming-language compiler.

Finally, I describe issues around developing and maintaining a Python bytecode decompiler, which handles input spanning 15 releases of the language over some 20 years.

Problems with existing methods for location reporting

One of my pet peeves is the vagueness of reporting a line number as a position in a program. Consider the following Python code:

prev = [100] + range(3)
def boom(p):
  x = prev[prev[p]]
...

If we run this in Python, we get:

IndexError: list index out of range

But which index caused the problem?

Or how about errors in these lines:

[x[0] for i in d[j] if got[i] == e[i]] # which index?
x = a / b / c           # which divide?
return fib(x) + fib(y)  # which fib?

And then there are the problems with functions like eval() which evaluate either a string containing a program-language fragment, or some intermediate representation of a program such as an Abstract Syntax Tree or a code object. In these cases, the program is created at run time, either from strings or from method calls which create AST objects. Either way, there is no file containing source code as a stream of characters.

The Pos type in Go's AST is more informative in that it encodes the line number, file name and column number all very compactly as a pointer. Even better, though, would be to extend this interval or pointer to a tuple of Pos's. Go's implementation cleverly takes advantage of the property that the number of interesting locations in a program, whether a single point or a pair of points, is much smaller than the program's total number of bytes.

Years of experience have shown that trying to get compilers and interpreters to track location correspondences through the compilation process and then save them in the run-time system is an uphill battle. When I suggested using interval positions in Go, the reaction was that the line/column approach was already slowing down compilation.

And adding a column position in a compiler that lacks this is, in truth, a big disruptive change and likely to introduce errors. The LLVM project, a fork of the GNU Compiler Collection (GCC), did this. However it took years before a corresponding change was added to GCC as well. In both cases, column position is tracked only up to the parsing phase and not at run time, where it is also useful.

Even with line and column number, you have problems. Consider:

  fn().a
  ^

Is the problem in accessing fn before the call or accessing the result after?

In my experience with Go, this feature sparked a discussion of how to disambiguate the two and where the canonical location should be.

Distinguishing before from after a call could be done by putting the column indicator at the beginning of the identifier, at the "f." If the column position was the position of the last letter, the "n" will be after the call. But the position is ambiguous if the identifier is a single letter. If we shift the column one place before or after the identifier, then there might be ambiguity as to which token is indicated because there might be another token like + there instead of a blank space. Can we use the opening and closing parentheses instead? This will work if the language does not require parentheses (so it will not work in Ruby). But then we consider:

 fn.a

and have the same issue: where to put the column position for attribute accessors? And is our somewhat arbitrary decision consistent with our earlier decision? And so on.

Even with all this hand wringing, the place ultimately chosen might cause confusion. For example in the standard CPython implementation, the line number reported in a multi-line string happens to be the line number at the end of the string, rather than at the beginning. I thought this confusing the first time I encountered it. In the PyPy Python implementation it is at the beginning.

Basically the problem is that a line number and a column number have too little information to fully describe a position in a program.

A much cleaner, more accurate solution using less thought is to supply an interval. Here the examples above become:

fn().a
--

for before the call, or to fn, and :

fn().a
----

for after the call. See below for how to disambiguate in a language like Ruby where parentheses are optional.

And in the second situation involving object accessors:

fn.a
--

vs.

fn.a
  --

There is much less ambiguity, and the little that remains is covered by supplying a grammar nonterminal name or instruction information at the stopping point.

For the examples given at the beginning of this section, the uncompyle6 decompiler gives:

x = prev[prev[0]]
        --------
x = prev[prev[0]]
    -------------
[x[0] for i in d[j] if got[i] == e[i]]
 ----
[x[0] for i in d[j] if got[i] == e[i]]
               ----
[x[0] for i in d[j] if got[i] == e[i]]
                       ------
[x[0] for i in d[j] if got[i] == e[i]]
                                 ----
x = a / b / c
          -
x = a / b / c
      -

In sum, the solution given here eliminates the classes of problems described above:

ambiguity of location,
source does not exist as text, and
invasive changes to compiler and run-time environment

As to the last point, all of this is done without modifying the compilation process. We do have to hook into the run-time environment, but only at the point of an exception which is usually easy. Often the language has in place a callback-hook mechanism for handling run-time exceptions.

Since this is done as a separate isolated section of code or perhaps an external module, we can give a very precise location without slowing down compilation or bloating the run-time system. Even more, the position is not just a range of text in the source code, but it also can be associated with grammar constructs.

I close this section with another example, using my Python 3 debugger which uses uncompyle6's decompilation routines. This is the interaction of the debugger in debugging a simple-minded implementation of the Fibonacci function:

(/tmp/fib.py:1): fib
-> 1 def fib(a):
(trepan3k) up
(/tmp/fib.py:3 @30): fib
(trepan3k) backtrace
##0 fib(a=0) called from file '/tmp/fib.py' at line 1
->1 fib(a=2) called from file '/tmp/fib.py' at line 3
##2 <module> exec()
     '/tmp/fib.py' at line 5
(trepan3k) deparse -p
instruction:           30  CALL_FUNCTION_1       1
grammar LHS: call_function
return fib(a - 1) + fib(a - 2)
                    ----------
Contained in...
Grammar Symbol: binary_expr
return fib(a - 1) + fib(a - 2)
       -----------------------

In the above, we went an entry up in the call stack. The output of running the deparse command gives us:

The instruction where we are stopped. In the example above, it is the CALL_FUNCTION instruction.
The source-code text associated with this, and
The larger context which is a binary_expr, here an addition.

Some History

Decompiling is not a new idea. The code base that I use in Python goes back almost 20 years, and through a long period there were a number of anonymous maintainers.

In a language like Python, there is a lot of churn from version to version and keeping the decompiler up to date is a bit of a challenge. In another section I describe how to deal with that.

The earliest authors and developers of the Python decompiler were well versed in compiler technology, but later maintainers of the code, not so much. As a result there were hacks and patches done in places that I believe would have been better done within the frameworks of the parsing technology rather than through more ad hoc methods. Later I will describe uncompyle6's pipeline in the hope that this can be used in other decompilation systems.

Like Python, the corresponding decompilation package for Perl also goes back almost 20 years. However the decompiling technique in the decompilers for the two languages are very different. Perl5's evaluator is tree based. This makes decompilation easier: no parser is needed to build an AST since that is already in place. Also, there has been less change from release to release of the instructions over the course of Perl's 30 years or so.

So perhaps it is not surprising that decompilation in Perl is more ad hoc than in Python.

While decompiling technology is not new in Perl or Python, what is new is using it to find the exact location at run time, say in a debugger or when in a stack trace. In order to do this we just need to keep around associations from offset to text fragments. I also keep around the parse tree for the functions that are contained in the call stack so as to give nonterminal names and the parent context for the fragment in question.

Also somewhat novel, at least for Python, is beefing up technology and testing to track the changes over in Python's bytecode over a long period (15 versions, with variations, spanning 20 years).

Debugging Optimized Code

There are difficulties in debugging code that has been optimized. This is a problem that has been long recognized, and there is no universally accepted approach.

The difficulty arises because a really good optimizer will scramble the program so that there no longer is one-to-one mapping between instructions and source-code constructs. Some source code instructions can be moved around, duplicated, or eliminated. Similarly from the instruction point of view, an instruction might correspond to several different places in the source code all at once, or an instruction might represent an aspect of execution that is implicit in the program, and therefore has no corresponding text. Stack clean up on leaving a block of code is an example of this. (We cannot associate this with the end terminator for the compound statement, for example, because Python does not have such a text construct, and a single newline can close several compound statements.)

To work on debuggers at all, we have to roll up our sleeves and get our hands plenty dirty. The additional complexity involved when there is code optimization just makes this task harder, so it is understandable that this aspect has largely been ignored.

Here is a canonical tricky problem debugging optimized code. Suppose the source code looks like:

if condition then
   # some unconditional code
   a :=  x / z
else
   # some more unconditional code
   b :=  (x / z) + 2
end

and suppose an optimizing compiler decides it is safe to rewrite this as:

t1 := x / z
if condition then
   # some unconditional code
   a := t1
else:
   # some more unconditional code
   b := t1 + 2
end

Suppose we get an error in the hoisted code, specifically a divide by zero error in evaluating x / z.

If we try to figure out what went wrong, we are trying to assess why z is zero. We really do not care about all the stuff after the t1 assignment in the hoisted code such as what is in the condition or the unconditional code that appears in the then and else branches.

In fact, it is helpful to know that we can just forget about all of that. What we are interested in is how t1 got its value. And that is exactly what is shown in the decompiled code.

The assignment to t1 might look like a puzzle since we did not write t1 anywhere. It would be nice if the optimizer recorded its actions somewhere. That way we would know that this code was hoisted. Realistically, there is no optimizer in widespread use that does this in a way that is accessible to a programmer when an error is encountered.

But if there were information, a decompiler could help here since it has a parse tree of the source code. So it could feed the source code or abstract syntax tree to some sort of code comparison analyzer which also has access to the original source.

Currently, programmers debugging optimized code typically perform these steps consciously or unconsciously. They do a mental decompilation to understand what the program was actually doing. So we have just assisted a step by turning what might be an unfamiliar language -- bytecode or machine instructions -- into a familiar programming language.

In other words, one way a computer can assist debugging optimized code is to describe what is there as simply and naturally as possible. Then come the additional steps of untangling or explaining the transformations.

I believe that if we just present the code simply, the way the computer sees it, and if we have reference to the original source code, we will be able to understand why the two are equivalent. And in some cases, we might decide to rewrite the program along the lines of the transformed program.

As an example, for the following Python code:

if x:
    return x
else:
    return f(x)

when we use uncompyle6 to decompile the bytecode produced by the C Python interpreter, equivalent source code is produced:

return x or f(x)

We may decide we like this formulation better. On the other hand, if we had started with:

JUMP_ABSOLUTE = 100
...
if op == JUMP_ABSOLUTE:

this deparses to:

if op == 100:

Here we probably want to keep our original formulation as it was intentionally more verbose for clarity, flexibility and readability.

The kinds of peephole optimizations done by the Perl interpreter are often undoable and the Perl decompiler makes the reverse transformation when it can. In Python some reversable peephole optimizations can be covered by additional grammar rules.

However when an optimization simplifies code, that transformation is not reversable. An example of this in Python is where this code:

if 1:
  statement

gets simplified in bytecode to:

statement

Given the line number associated with the original source code, which is often around in the bytecode, we can easily infer what transformation the Python interpreter made and why. More generally, there may artifacts other than line numbers, such as some dead code, that a tool might use to warn of such a situation.

In contrast to this simple approach using decompilation, most of the other approaches to debugging optimized code involve changing the compilation system to include additional information. This can be an adjunct to decompilation information.

My own experience in debugging optimized code using decompilation has been positive.

Anatomy of a Deparser: Decompiling is compiler technology

The Python decompiler uncompyle6 uses this decompiling pipeline:

disassemble
massage disassembly
handle control flow
parse using a grammar - (custom grammar rules may added)
work on AST to recreate source text and map offsets to fragments of text

This should look somewhat familiar to compiler writers; it is similar to the classic compiler pipeline:

scan tokens
parse using a grammar
work on AST (e.g., type checking, instruction generation)
register allocation, optimization, final instruction assembly

In this light, decompiling is much simpler than parsing. For example we do not need to do type checking or register allocation.

And in decompilation it simplifies things to assume the input is well formed. So we do not even have to understand all the instructions in full gory detail, but instead just enough to recreate the source.

Below we go into more depth of each decompilation phase.

Scanning

In conventional compiler technology we scan characters and build up tokens, like identifier or string. Tokens have an additional attribute like 5 for number or "rocky" for string.

In a decompiler, the token is an instruction. The token name is then the opcode name and the additional attribute is generally the operand of the instruction.

Token massaging

In uncompyle6 we change the opcode name for those opcodes that take a variable number of arguments. Here the number of arguments for a specific instruction, found as an operand of the instruction, is appended to the opcode name. For example: if we see a BUILD_LIST instruction which an argument of 2 (it operates on the top two expressions on the stack), instead of using a generic rule like:

list ::= expr* BUILD_LIST

we create a custom grammar rule:

list ::= expr expr BUILD_LIST_2

This speeds up parsing by reducing the number of possible grammar rules that can be applied. It also reduces the possibility of ambiguous parses. Right after disassembly we make a pass to convert such opcode names.

Why not just go ahead and add the grammar rule at the first point where we turn BUILD_LIST into BUILD_LIST2? I will explain in the Parsing section

Control Flow

When there is little to no branch optimization, e.g., changing the targets of an unconditional jump instruction that jumps to other unconditional jump instructions, we can often come up with a grammar that does not need special treatment of control flow. The nesting and sequencing of compound statements follow well what grammars can track. Early versions of Python did not do branch optimization, so their decompilers did not have much in the way of control-flow analysis.

Over time more branch optimization took place. Any serious decompiler for Python 3.6 will need some sort of control-flow analysis.

The approach used in uncompyle6 is to add pseudo bracketing tokens. The first one introduced was COME_FROM. Later the name was refined to indicate the larger compound structure, e.g., COME_FROM_LOOP

Local inspection of bytecode around an offset is sometimes all that is needed to distinguish two similar kinds of Python constructs. For example, to distinguish a and b from if a: b, an instruction in the if statement will push a block on the evaluation stack but not for an and expression.

I am currently working on full flow-control analysis via basic blocks and dominator regions.

Parsing

In the Scanning section above, we added a custom parse rule as a result of seeing BUILD_LIST with a attribute of 2 (the list contains two items). We said this was done "before parsing" not "after disassembly." The two are not strictly the same. As with conventional compilers there is a tokenization phase and parsing phase, and these are distinct with respect to the larger plan.

Since we support multiple versions of Python, we use different grammars for each version. So before adding the custom grammar rule, there is some code to pull in the right grammar for that version of Python.

In a conventional compiler, the grammar is unambiguous: we want a given source-code construct to mean one thing. In decompiling it is natural that there can be multiple correct source-code expressions for a given set of instructions. For example: if a: then f() might come out as the shorter a and f().

uncompyle6 uses a Earley-Algorithm parser which can handle ambiguous grammars.

Making grammars look similar

These may seem obvious, but it is worth mentioning. At the top level of the grammar, symbols for the AST used by both the compiler and the decompiler can and should be the same. For example, generally a programming language has statements, and those might be simple or compound, with the compound statements taking different forms like if statements, loop statements and so on.

For interoperability it is desirable to make the decompiler's nonterminals which generally appear at the top of a tree match the names used in the compiler's AST. Again these would be the names for statements, if statements, and so on.

Unavoidably, though, there has to be some point at which these differ since the source code and object code are different languages.

One other area of grammar-rule writing that I encountered helps in handling multiple grammars for what is logically a single language with slight variations between versions. There are sometimes additional rules in place grouping one or more terminal symbols so that at the next level up a grammar rule can be the same as that between two grammars.

As a specific example, the Python version 2.6 sequence JUMP_IF_TRUE POP_TOP was combined into a single opcode POP_JUMP_IF_TRUE in Python 2.7.

So rather than have this rule for the Python 2.6 grammar:

testtrue_then ::= expr JUMP_IF_TRUE POP_TOP

and this one for the Python 2.7 grammar:

testtrue_then ::= expr POP_JUMP_IF_TRUE

there is a single rule for Python 2.6, 2.7, and others:

testtrue_then ::= expr jmp_true

with the corresponding rules for jmp_true, e.g., for Python 2.6:

jmp_true ::= JUMP_TRUE POP_TOP

and for Python 2.7 the singleton rule:

jmp_true ::= POP_JUMP_IF_TRUE

Although this adds levels to the generated tree, it simplifies semantic actions. For example, in the Python grammars jmp_true is also used in this rule:

assert ::= assert_expr jmp_true LOAD_ASSERT RAISE_VARARGS_1 come_froms_pop

By using a single nonterminal here rather than one or two terminal symbols depending on grammar version, the ordinal position of LOAD_ASSERT is kept the same. The semantic actions need to pull out the attribute of that instruction, and having a fixed location of it helps. Right now, just one set of semantic-action routines covers all grammars.

AST and display

To simplify the display complexities caused by adding the singleton rule, the AST display routine is jiggered to hide such singleton derivations between nonterminals.

Another AST simplification is to mark certain nonterminals as being list-oriented. This effect is deeper than visual display though. When the AST tree builder is building nodes for those kinds of nonterminals, it appends them to a list, for example:

stmts -> [stmt, stmt, stmt]

rather than creating a chain like:

stmts -> [stmts, stmt]
            |
            +--> [stmts, stmt]
                   |
	               +--> stmt

Semantic actions - walking the AST

As mentioned above, we do not necessarily have to use all the pieces of the AST, just enough to recreate the source. (In that sense the tree created from instructions is not all that abstract: in contrast to a conventional AST for a high-level language, few if any of the input tokens ir instructions have been discarded)

In any compiler that maps high-level constructs (i.e., more abstract, and thus shorter, sequences) into longer, lower-level constructs, we expect that there will be places where a token or text might be duplicated. A decompiler needs to use just one of those copies.

Here is an example. Suppose we have a loop that iterates over variable i. There may be code to initialize i before entering the loop and another instruction to increment the variable i. But in the source code the variable might appear only once:

for i in range(10):
   print(i)

This particular situation does not occur in Python, but duplication occurs in other complicated ways. Specifically, function information can be obtained from either part of Python's code structure for that function or from the instructions that dynamically create the function.

Mechanics

Code Reduction

A big challenge in the Python decompiler code base is maintaining it as new releases come out. Right now there is a new Python release about every year, and the code currently supports 15 or so releases. More often than not, both the programming language and bytecode change.

It is desirable to store changes as sets of differences rather than as fully independent code for a number of reasons:

Changes in opcodes mean changes in grammar rules, and using differences makes the correspondence clearer; similarly ...
Changes in grammar can lead to changes in semantic actions
When a bug is found, naturally, it is found a particular release. Tracking the differences facilitates understanding what other releases may be affected.
Noting the differences between releases aids in writing other tools. For example, when I started writing a bytecode-to-bytecode converter totally independent of the decompiler, having the version-to-version differences listed was helpful.

The way we handle this in uncompyle6 is via Object-Oriented subclassing. The base class however is not the first version of Python. Instead we use some middle version for each of the Python 2 and Python 3 releases and work from that.

Usually changes are small, although occasionally there are large changes. That is why, for example, the base class for Python 3 is different from Python 2.

Whenever we work with differences, we need a way not only to add things, but also to remove things. As a result, although a bit unusual, I have extended the Earley Algorithm parser so that it can remove grammar rules.

Testing

As any compiler developer knows, testing is important. It is also generally time consuming since a lot of data or programs needs to be tested and the compilation process is usually not that speedy.

We use common-practice techniques for testing, some of which are often used in software engineering but need a little adaption for compilers. Some techniques that are in common use in compiler development also require adaption.

Since running a full test suite takes a long time, we run quick unit tests first. These target specific bits of code precisely, so when they fail we generally know what isolated section of code to look at and fix.

Daniel Bradburn also recently introduced hypothesis testing into the code base as another means to get broader coverage with a small amount of code.

We use three free (for open-source projects) continuous-integration services: CircleCI, Travis and Appveyor. These help out with running massive regression tests in parallel. With Travis we can cover many different Python releases which are run in parallel. On CircleCI we do massive testing using the full installed Python system library. Appveyor handles testing on Microsoft Windows.

Another testing technique from software development is measuring code coverage. However in uncompyle6, recall that some of our "code" is grammar rules. So I have extended the parser system to record grammar coverage in the reduction rules. As mentioned above, rules get pulled in via subclassing, and looking at grammar coverage has been helpful in deciding which rules should go where, which rules to remove, and which additional tests are needed.

A common testing technique used especially in compiler testing is "round tripping," or performing some series of transformations which should lead back exactly to the starting point. Applied here, this would mean compiling source, decompiling that, and then checking that the source code is the same with some fuzzing with respect to inessential differences such as indentation or comments.

For this to work however, we would need to make sure that the source code formatting is exactly the same as the output we reproduce. Since this requires additional work, we do not do this. A better variation is available in later versions of Python: go one step further and compare ASTs generated from the source code. This removes the inessential differences.

A similar procedure would be to start out with bytecode rather than the source code, decompile that, and then compile again. The resulting bytecode should match the original bytecode with some fuzzing.

However this, too, is not without problems. Although it is rare, Python does optimize code. (An example of this was given earlier.) So the source reproduced will represent optimized code. We could carry this one step further and do it again.

For a limited set of programs that we know will round trip, we do this kind of test starting with bytecode. But note that doing only two-thirds of the round trip will still find errors: decompile bytecode, compile source code. In that last step, compiling source code, we often find errors because source code we produced is not valid.

In Python, as in other large systems, there already is a test suite for checking the distributed library code. Since those programs are self checking, those programs too can be decompiled and run.

Summary

I introduce the idea of using a decompilers in order to show where a program is at any given time. They are useful in run-time error analysis, stack traces and debuggers, and also in debugging optimized code. The design of a decompiler for Python is much like that of a conventional compiler, possibly much simpler, and many of the issues other compiler writers have encountered can be dealt with in a familiar fashion (see the Parsing and AST sections). Finally, I discuss the day-to-day issue issues of testing. Again, this will also be familiar to those who have worked with compilers.

I hope that this paper serves to demystify decompilers and encourage their use more widely in other languages so that in the future there is less ambiguity of error reporting and more assistance in debugging optimized code. There is still much to do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly