-
-
Notifications
You must be signed in to change notification settings - Fork 410
Deparsing Paper
This was a the original draft for https://github.com/rocky/python-uncompyle6/wiki/Deparsing-Paper.pdf
Table of Contents
Decompiling is often used in conjunction with recovering lost source-code, or reverse engineering code when you don't have access to the source code.
Here I describe another use: in tracebacks, debuggers, core dumps and similar places where accurate position-reporting is desired. I show how more conventional methods of source-code reporting are vague and ambiguous.
Decompilation is the most accurate way to describe where the program currently is, in high-level terms familiar to the programmer. It is not in widespread use because it is unfamiliar and labor-intensive. This paper shows how to ameliorate these problems by giving a pattern for writing decompilers. I hope to encourage the incorporation of decompilation into tracebacks, debuggers, core dumps, and so on.
I also show how decompilation advances the art of debugging optimized code.
Although effecting a pervasive change in a compiler is arduous and error-prone, the decompiler can be developed somewhat independently.
Next, I describe a pipeline for a Python decompiler, and show similarities with a conventional programming-language compiler.
Finally, I describe issues around developing and maintaining a Python bytecode decompiler, which handles input spanning 15 releases of the language over some 20 years.
One of my pet peeves is the vagueness of reporting a line number as a position in a program. Consider the following Python code:
prev = [100] + range(3)
def boom(p):
x = prev[prev[p]]
...
If we run this in Python, we get:
IndexError: list index out of range
But which index caused the problem?
Or how about errors in these lines:
[x[0] for i in d[j] if got[i] == e[i]] # which index?
x = a / b / c # which divide?
return fib(x) + fib(y) # which fib?
And then there are the problems with functions like eval() which evaluate either a string containing a program-language fragment, or some intermediate representation of a program such as an Abstract Syntax Tree or a code object. In these cases, the program is created at run time, either from strings or from method calls which create AST objects. Either way, there is no file containing source code as a stream of characters.
The Pos type in Go's AST is more informative in that it encodes the line number, file name and column number all very compactly as a pointer. Even better, though, would be to extend this interval or pointer to a tuple of Pos's. Go's implementation cleverly takes advantage of the property that the number of interesting locations in a program, whether a single point or a pair of points, is much smaller than the program's total number of bytes.
Years of experience have shown that trying to get compilers and interpreters to track location correspondences through the compilation process and then save them in the run-time system is an uphill battle. When I suggested using interval positions in Go, the reaction was that the line/column approach was already slowing down compilation.
And adding a column position in a compiler that lacks this is, in truth, a big disruptive change and likely to introduce errors. The LLVM project, a fork of the GNU Compiler Collection (GCC), did this. However it took years before a corresponding change was added to GCC as well. In both cases, column position is tracked only up to the parsing phase and not at run time, where it is also useful.
Even with line and column number, you have problems. Consider:
fn().a
^
Is the problem in accessing fn
before the call or accessing the
result after?
In my experience with Go, this feature sparked a discussion of how to disambiguate the two and where the canonical location should be.
Distinguishing before from after a call could be done by
putting the column indicator at the beginning of the identifier, at
the "f." If the column position was the position of the last letter,
the "n" will be after the call. But the position is
ambiguous if the identifier is a single letter. If we shift the column
one place before or after the identifier, then there might be ambiguity as
to which token is indicated because there might be another token like +
there
instead of a blank space. Can we use the opening
and closing parentheses instead? This will work if the language does
not require parentheses (so it will not work in Ruby). But then we
consider:
fn.a
and have the same issue: where to put the column position for attribute accessors? And is our somewhat arbitrary decision consistent with our earlier decision? And so on.
Even with all this hand wringing, the place ultimately chosen might cause confusion. For example in the standard CPython implementation, the line number reported in a multi-line string happens to be the line number at the end of the string, rather than at the beginning. I thought this confusing the first time I encountered it. In the PyPy Python implementation it is at the beginning.
Basically the problem is that a line number and a column number have too little information to fully describe a position in a program.
A much cleaner, more accurate solution using less thought is to supply an interval. Here the examples above become:
fn().a
--
for before the call, or to fn, and :
fn().a
----
for after the call. See below for how to disambiguate in a language like Ruby where parentheses are optional.
And in the second situation involving object accessors:
fn.a
--
vs.
fn.a
--
There is much less ambiguity, and the little that remains is covered by supplying a grammar nonterminal name or instruction information at the stopping point.
For the examples given at the beginning of this section, the uncompyle6 decompiler gives:
x = prev[prev[0]]
--------
x = prev[prev[0]]
-------------
[x[0] for i in d[j] if got[i] == e[i]]
----
[x[0] for i in d[j] if got[i] == e[i]]
----
[x[0] for i in d[j] if got[i] == e[i]]
------
[x[0] for i in d[j] if got[i] == e[i]]
----
x = a / b / c
-
x = a / b / c
-
In sum, the solution given here eliminates the classes of problems described above:
- ambiguity of location,
- source does not exist as text, and
- invasive changes to compiler and run-time environment
As to the last point, all of this is done without modifying the compilation process. We do have to hook into the run-time environment, but only at the point of an exception which is usually easy. Often the language has in place a callback-hook mechanism for handling run-time exceptions.
Since this is done as a separate isolated section of code or perhaps an external module, we can give a very precise location without slowing down compilation or bloating the run-time system. Even more, the position is not just a range of text in the source code, but it also can be associated with grammar constructs.
I close this section with another example, using my Python 3 debugger which uses uncompyle6's decompilation routines. This is the interaction of the debugger in debugging a simple-minded implementation of the Fibonacci function:
(/tmp/fib.py:1): fib
-> 1 def fib(a):
(trepan3k) up
(/tmp/fib.py:3 @30): fib
(trepan3k) backtrace
##0 fib(a=0) called from file '/tmp/fib.py' at line 1
->1 fib(a=2) called from file '/tmp/fib.py' at line 3
##2 <module> exec()
'/tmp/fib.py' at line 5
(trepan3k) deparse -p
instruction: 30 CALL_FUNCTION_1 1
grammar LHS: call_function
return fib(a - 1) + fib(a - 2)
----------
Contained in...
Grammar Symbol: binary_expr
return fib(a - 1) + fib(a - 2)
-----------------------
In the above, we went an entry up in the call stack. The output of
running the deparse
command gives us:
- The instruction where we are stopped. In the example above, it is
the
CALL_FUNCTION
instruction. - The source-code text associated with this, and
- The larger context which is a
binary_expr
, here an addition.
Decompiling is not a new idea. The code base that I use in Python goes back almost 20 years, and through a long period there were a number of anonymous maintainers.
In a language like Python, there is a lot of churn from version to version and keeping the decompiler up to date is a bit of a challenge. In another section I describe how to deal with that.
The earliest authors and developers of the Python decompiler were well versed in compiler technology, but later maintainers of the code, not so much. As a result there were hacks and patches done in places that I believe would have been better done within the frameworks of the parsing technology rather than through more ad hoc methods. Later I will describe uncompyle6's pipeline in the hope that this can be used in other decompilation systems.
Like Python, the corresponding decompilation package for Perl also goes back almost 20 years. However the decompiling technique in the decompilers for the two languages are very different. Perl5's evaluator is tree based. This makes decompilation easier: no parser is needed to build an AST since that is already in place. Also, there has been less change from release to release of the instructions over the course of Perl's 30 years or so.
So perhaps it is not surprising that decompilation in Perl is more ad hoc than in Python.
While decompiling technology is not new in Perl or Python, what is new is using it to find the exact location at run time, say in a debugger or when in a stack trace. In order to do this we just need to keep around associations from offset to text fragments. I also keep around the parse tree for the functions that are contained in the call stack so as to give nonterminal names and the parent context for the fragment in question.
Also somewhat novel, at least for Python, is beefing up technology and testing to track the changes over in Python's bytecode over a long period (15 versions, with variations, spanning 20 years).
There are difficulties in debugging code that has been optimized. This is a problem that has been long recognized, and there is no universally accepted approach.
The difficulty arises because a really good optimizer will scramble the program so that there no longer is one-to-one mapping between instructions and source-code constructs. Some source code instructions can be moved around, duplicated, or eliminated. Similarly from the instruction point of view, an instruction might correspond to several different places in the source code all at once, or an instruction might represent an aspect of execution that is implicit in the program, and therefore has no corresponding text. Stack clean up on leaving a block of code is an example of this. (We cannot associate this with the end terminator for the compound statement, for example, because Python does not have such a text construct, and a single newline can close several compound statements.)
To work on debuggers at all, we have to roll up our sleeves and get our hands plenty dirty. The additional complexity involved when there is code optimization just makes this task harder, so it is understandable that this aspect has largely been ignored.
Here is a canonical tricky problem debugging optimized code. Suppose the source code looks like:
if condition then
# some unconditional code
a := x / z
else
# some more unconditional code
b := (x / z) + 2
end
and suppose an optimizing compiler decides it is safe to rewrite this as:
t1 := x / z
if condition then
# some unconditional code
a := t1
else:
# some more unconditional code
b := t1 + 2
end
Suppose we get an error in the hoisted code, specifically a divide by
zero error in evaluating x / z
.
If we try to figure out what went wrong, we are trying to assess why
z
is zero. We really do not care about all the stuff after the t1
assignment in the hoisted code such as what is in the condition or
the unconditional code that appears in the then and else branches.
In fact, it is helpful to know that we can just forget about all of
that. What we are interested in is how t1
got its value. And
that is exactly what is shown in the decompiled code.
The assignment to t1
might look like a puzzle since we did not write
t1
anywhere. It would be nice if the optimizer recorded its actions
somewhere. That way we would know that this code was
hoisted. Realistically, there is no optimizer in widespread use that
does this in a way that is accessible to a programmer when an error is
encountered.
But if there were information, a decompiler could help here since it has a parse tree of the source code. So it could feed the source code or abstract syntax tree to some sort of code comparison analyzer which also has access to the original source.
Currently, programmers debugging optimized code typically perform these steps consciously or unconsciously. They do a mental decompilation to understand what the program was actually doing. So we have just assisted a step by turning what might be an unfamiliar language -- bytecode or machine instructions -- into a familiar programming language.
In other words, one way a computer can assist debugging optimized code is to describe what is there as simply and naturally as possible. Then come the additional steps of untangling or explaining the transformations.
I believe that if we just present the code simply, the way the computer sees it, and if we have reference to the original source code, we will be able to understand why the two are equivalent. And in some cases, we might decide to rewrite the program along the lines of the transformed program.
As an example, for the following Python code:
if x:
return x
else:
return f(x)
when we use uncompyle6 to decompile the bytecode produced by the C Python interpreter, equivalent source code is produced:
return x or f(x)
We may decide we like this formulation better. On the other hand, if we had started with:
JUMP_ABSOLUTE = 100
...
if op == JUMP_ABSOLUTE:
this deparses to:
if op == 100:
Here we probably want to keep our original formulation as it was intentionally more verbose for clarity, flexibility and readability.
The kinds of peephole optimizations done by the Perl interpreter are often undoable and the Perl decompiler makes the reverse transformation when it can. In Python some reversable peephole optimizations can be covered by additional grammar rules.
However when an optimization simplifies code, that transformation is not reversable. An example of this in Python is where this code:
if 1:
statement
gets simplified in bytecode to:
statement
Given the line number associated with the original source code, which is often around in the bytecode, we can easily infer what transformation the Python interpreter made and why. More generally, there may artifacts other than line numbers, such as some dead code, that a tool might use to warn of such a situation.
In contrast to this simple approach using decompilation, most of the other approaches to debugging optimized code involve changing the compilation system to include additional information. This can be an adjunct to decompilation information.
My own experience in debugging optimized code using decompilation has been positive.
The Python decompiler uncompyle6 uses this decompiling pipeline:
- disassemble
- massage disassembly
- handle control flow
- parse using a grammar - (custom grammar rules may added)
- work on AST to recreate source text and map offsets to fragments of text
This should look somewhat familiar to compiler writers; it is similar to the classic compiler pipeline:
- scan tokens
- parse using a grammar
- work on AST (e.g., type checking, instruction generation)
- register allocation, optimization, final instruction assembly
In this light, decompiling is much simpler than parsing. For example we do not need to do type checking or register allocation.
And in decompilation it simplifies things to assume the input is well formed. So we do not even have to understand all the instructions in full gory detail, but instead just enough to recreate the source.
Below we go into more depth of each decompilation phase.
In conventional compiler technology we scan characters and build up tokens, like identifier or string. Tokens have an additional attribute like 5 for number or "rocky" for string.
In a decompiler, the token is an instruction. The token name is then the opcode name and the additional attribute is generally the operand of the instruction.
In uncompyle6 we change the opcode name for those opcodes that take
a variable number of arguments. Here the number of arguments for
a specific instruction, found as an operand of the instruction, is
appended to the opcode name. For example: if we see a BUILD_LIST
instruction which an argument of 2 (it operates on the top two
expressions on the stack), instead of using a generic rule like:
list ::= expr* BUILD_LIST
we create a custom grammar rule:
list ::= expr expr BUILD_LIST_2
This speeds up parsing by reducing the number of possible grammar rules that can be applied. It also reduces the possibility of ambiguous parses. Right after disassembly we make a pass to convert such opcode names.
Why not just go ahead and add the grammar rule at the first point
where we turn BUILD_LIST
into BUILD_LIST2
? I will explain in the
Parsing section
When there is little to no branch optimization, e.g., changing the targets of an unconditional jump instruction that jumps to other unconditional jump instructions, we can often come up with a grammar that does not need special treatment of control flow. The nesting and sequencing of compound statements follow well what grammars can track. Early versions of Python did not do branch optimization, so their decompilers did not have much in the way of control-flow analysis.
Over time more branch optimization took place. Any serious decompiler for Python 3.6 will need some sort of control-flow analysis.
The approach used in uncompyle6 is to add pseudo bracketing
tokens. The first one introduced was COME_FROM
. Later the name was
refined to indicate the larger compound structure, e.g.,
COME_FROM_LOOP
Local inspection of bytecode around an offset is sometimes all that is
needed to distinguish two similar kinds of Python constructs. For
example, to distinguish a and b
from if a: b
, an instruction in
the if
statement will push a block on the evaluation stack but
not for an and
expression.
I am currently working on full flow-control analysis via basic blocks and dominator regions.
In the Scanning section above, we added a custom parse rule as a
result of seeing BUILD_LIST
with a attribute of 2 (the list contains
two items). We said this was done "before parsing" not "after
disassembly." The two are not strictly the same. As with conventional compilers there is a
tokenization phase and parsing phase, and these are distinct with
respect to the larger plan.
Since we support multiple versions of Python, we use different grammars for each version. So before adding the custom grammar rule, there is some code to pull in the right grammar for that version of Python.
In a conventional compiler, the grammar is unambiguous: we want a given
source-code construct to mean one thing. In decompiling it is natural
that there can be multiple correct source-code expressions for a given
set of instructions. For example: if a: then f()
might come out as
the shorter a and f()
.
uncompyle6 uses a Earley-Algorithm parser which can handle ambiguous grammars.
These may seem obvious, but it is worth mentioning. At the top level of the grammar, symbols for the AST used by both the compiler and the decompiler can and should be the same. For example, generally a programming language has statements, and those might be simple or compound, with the compound statements taking different forms like if statements, loop statements and so on.
For interoperability it is desirable to make the decompiler's nonterminals which generally appear at the top of a tree match the names used in the compiler's AST. Again these would be the names for statements, if statements, and so on.
Unavoidably, though, there has to be some point at which these differ since the source code and object code are different languages.
One other area of grammar-rule writing that I encountered helps in handling multiple grammars for what is logically a single language with slight variations between versions. There are sometimes additional rules in place grouping one or more terminal symbols so that at the next level up a grammar rule can be the same as that between two grammars.
As a specific example, the Python version 2.6 sequence
JUMP_IF_TRUE POP_TOP
was combined into a single opcode
POP_JUMP_IF_TRUE
in Python 2.7.
So rather than have this rule for the Python 2.6 grammar:
testtrue_then ::= expr JUMP_IF_TRUE POP_TOP
and this one for the Python 2.7 grammar:
testtrue_then ::= expr POP_JUMP_IF_TRUE
there is a single rule for Python 2.6, 2.7, and others:
testtrue_then ::= expr jmp_true
with the corresponding rules for jmp_true
, e.g., for Python 2.6:
jmp_true ::= JUMP_TRUE POP_TOP
and for Python 2.7 the singleton rule:
jmp_true ::= POP_JUMP_IF_TRUE
Although this adds levels to the generated tree, it simplifies
semantic actions. For example, in the Python grammars jmp_true
is also used in this rule:
assert ::= assert_expr jmp_true LOAD_ASSERT RAISE_VARARGS_1 come_froms_pop
By using a single nonterminal here rather than one or two terminal
symbols depending on grammar version, the ordinal position of
LOAD_ASSERT
is kept the same. The semantic actions need to pull out
the attribute of that instruction, and having a fixed location of it
helps. Right now, just one set of semantic-action routines covers all grammars.
To simplify the display complexities caused by adding the singleton rule, the AST display routine is jiggered to hide such singleton derivations between nonterminals.
Another AST simplification is to mark certain nonterminals as being list-oriented. This effect is deeper than visual display though. When the AST tree builder is building nodes for those kinds of nonterminals, it appends them to a list, for example:
stmts -> [stmt, stmt, stmt]
rather than creating a chain like:
stmts -> [stmts, stmt]
|
+--> [stmts, stmt]
|
+--> stmt
As mentioned above, we do not necessarily have to use all the pieces of the AST, just enough to recreate the source. (In that sense the tree created from instructions is not all that abstract: in contrast to a conventional AST for a high-level language, few if any of the input tokens ir instructions have been discarded)
In any compiler that maps high-level constructs (i.e., more abstract, and thus shorter, sequences) into longer, lower-level constructs, we expect that there will be places where a token or text might be duplicated. A decompiler needs to use just one of those copies.
Here is an example. Suppose we have a loop that iterates over variable i. There may be code to initialize i before entering the loop and another instruction to increment the variable i. But in the source code the variable might appear only once:
for i in range(10):
print(i)
This particular situation does not occur in Python, but duplication occurs in other complicated ways. Specifically, function information can be obtained from either part of Python's code structure for that function or from the instructions that dynamically create the function.
A big challenge in the Python decompiler code base is maintaining it as new releases come out. Right now there is a new Python release about every year, and the code currently supports 15 or so releases. More often than not, both the programming language and bytecode change.
It is desirable to store changes as sets of differences rather than as fully independent code for a number of reasons:
- Changes in opcodes mean changes in grammar rules, and using differences makes the correspondence clearer; similarly ...
- Changes in grammar can lead to changes in semantic actions
- When a bug is found, naturally, it is found a particular release. Tracking the differences facilitates understanding what other releases may be affected.
- Noting the differences between releases aids in writing other tools. For example, when I started writing a bytecode-to-bytecode converter totally independent of the decompiler, having the version-to-version differences listed was helpful.
The way we handle this in uncompyle6 is via Object-Oriented subclassing. The base class however is not the first version of Python. Instead we use some middle version for each of the Python 2 and Python 3 releases and work from that.
Usually changes are small, although occasionally there are large changes. That is why, for example, the base class for Python 3 is different from Python 2.
Whenever we work with differences, we need a way not only to add things, but also to remove things. As a result, although a bit unusual, I have extended the Earley Algorithm parser so that it can remove grammar rules.
As any compiler developer knows, testing is important. It is also generally time consuming since a lot of data or programs needs to be tested and the compilation process is usually not that speedy.
We use common-practice techniques for testing, some of which are often used in software engineering but need a little adaption for compilers. Some techniques that are in common use in compiler development also require adaption.
Since running a full test suite takes a long time, we run quick unit tests first. These target specific bits of code precisely, so when they fail we generally know what isolated section of code to look at and fix.
Daniel Bradburn also recently introduced hypothesis testing into the code base as another means to get broader coverage with a small amount of code.
We use three free (for open-source projects) continuous-integration services: CircleCI, Travis and Appveyor. These help out with running massive regression tests in parallel. With Travis we can cover many different Python releases which are run in parallel. On CircleCI we do massive testing using the full installed Python system library. Appveyor handles testing on Microsoft Windows.
Another testing technique from software development is measuring code coverage. However in uncompyle6, recall that some of our "code" is grammar rules. So I have extended the parser system to record grammar coverage in the reduction rules. As mentioned above, rules get pulled in via subclassing, and looking at grammar coverage has been helpful in deciding which rules should go where, which rules to remove, and which additional tests are needed.
A common testing technique used especially in compiler testing is "round tripping," or performing some series of transformations which should lead back exactly to the starting point. Applied here, this would mean compiling source, decompiling that, and then checking that the source code is the same with some fuzzing with respect to inessential differences such as indentation or comments.
For this to work however, we would need to make sure that the source code formatting is exactly the same as the output we reproduce. Since this requires additional work, we do not do this. A better variation is available in later versions of Python: go one step further and compare ASTs generated from the source code. This removes the inessential differences.
A similar procedure would be to start out with bytecode rather than the source code, decompile that, and then compile again. The resulting bytecode should match the original bytecode with some fuzzing.
However this, too, is not without problems. Although it is rare, Python does optimize code. (An example of this was given earlier.) So the source reproduced will represent optimized code. We could carry this one step further and do it again.
For a limited set of programs that we know will round trip, we do this kind of test starting with bytecode. But note that doing only two-thirds of the round trip will still find errors: decompile bytecode, compile source code. In that last step, compiling source code, we often find errors because source code we produced is not valid.
In Python, as in other large systems, there already is a test suite for checking the distributed library code. Since those programs are self checking, those programs too can be decompiled and run.
I introduce the idea of using a decompilers in order to show where a program is at any given time. They are useful in run-time error analysis, stack traces and debuggers, and also in debugging optimized code. The design of a decompiler for Python is much like that of a conventional compiler, possibly much simpler, and many of the issues other compiler writers have encountered can be dealt with in a familiar fashion (see the Parsing and AST sections). Finally, I discuss the day-to-day issue issues of testing. Again, this will also be familiar to those who have worked with compilers.
I hope that this paper serves to demystify decompilers and encourage their use more widely in other languages so that in the future there is less ambiguity of error reporting and more assistance in debugging optimized code. There is still much to do.