Disassembler
A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. A disassembler differs from a decompiler, which targets a high-level language rather than an assembly language. Disassembly, the output of a disassembler, is often formatted for human-readability rather than suitability for input to an assembler, making it principally a reverse-engineering tool.
Assembly language source code generally permits the use of constants and programmer comments. These are usually removed from the assembled machine code by the assembler. If so, a disassembler operating on the machine code would produce disassembly lacking these constants and comments; the disassembled output becomes more difficult for a human to interpret than the original annotated source code. Some disassemblers make use of the symbolic debugging information present in object files such as ELF. The Interactive Disassembler allow the human user to make up mnemonic symbols for values or regions of code in an interactive session: human insight applied to the disassembly process often parallels human creativity in the code writing process.
Disassembly is not an exact science: on CISC platforms with variable-width instructions, or in the presence of self-modifying code, it is possible for a single program to have two or more reasonable disassemblies. Determining which instructions would actually be encountered during a run of the program reduces to the proven-unsolvable halting problem.
Writing a disassembler which produces code which, when assembled, produces exactly the original binary is possible; however, there are often differences. This poses demands on the expressivity of the assembler. For example, an x86 assembler takes an arbitrary choice between two binary codes for something as simple as "MOV AX,BX". If the original code uses the other choice, the original code simply cannot be reproduced at any given point in time. However, even when a fully correct disassembly is produced, problems remain if the program requires modification. For example, the same machine language jump instruction can be generated by assembly code to jump to a specified location (for example, to execute specific code), or to jump to a specified number of bytes (for example, to skip over an unwanted branch). A disassembler cannot know what is intended, and may use either syntax, generating a disassembly which reproduces the original binary[citation needed]. However, if a programmer wants to add instructions between the jump instruction and its destination, it is necessary to understand the program's operation to determine whether the jump should be absolute or relative, i.e., whether its destination should remain at a fixed location, or be moved so as to skip both the original and added instructions.
Examples of disassemblers
A disassembler may be stand-alone or interactive. A stand-alone disassembler, when executed, generates an assembly language file which can be examined; an interactive one shows the effect of any change the user makes immediately. For example, the disassembler may initially not know that a section of the program is actually code, and treat it as data; if the user specifies that it is code, the resulting disassembled code is shown immediately, allowing the user to examine it and take further action during the same run.
Any interactive debugger will include some way of viewing the disassembly of the program being debugged. Often, the same disassembly tool will be packaged as a standalone disassembler distributed along with the debugger. For example, objdump, part of GNU Binutils, is related to the interactive debugger gdb.
IDA
OllyDbg is a 32-bit assembler level analysing debugger
OLIVER and SIMON include disassemblers for Assembler, COBOL, and PL/1
Decompiler
A decompiler is a computer program that performs the reverse operation to that of a compiler. That is, it translates program code at a relatively low level of abstraction (usually designed to be computer readable rather than human readable) into a form having a higher level of abstraction (usually designed to be human readable). Decompilers usually do not perfectly reconstruct the original source code, and can vary widely in the intelligibility of their outputs. Nonetheless, decompilers remain an important tool in software reverse engineering.
The term decompiler is most commonly applied to a program which translates executable programs (the output from a compiler) into source code in a (relatively) high level language which, when compiled, will produce an executable whose behavior is the same as the original executable program. By comparison, a disassembler translates an executable program into assembly language (and an assembler could be used to assemble it back into an executable program).
Decompilation is the act of using a decompiler, although the term can also refer to the output of a decompiler. It can be used for the recovery of lost source code, and is also useful in some cases for computer security, interoperability and error correction.[1][unreliable source] The success of decompilation depends on the amount of information present in the code being decompiled and the sophistication of the analysis performed on it. The bytecode formats used by many virtual machines (such as the Java Virtual Machine or the .NET Framework Common Language Runtime) often include extensive metadata and high-level features that make decompilation quite feasible. The presence of debug data can make it possible to reproduce the original variable and structure names and even the line numbers. Machine language without such metadata or debug data is much harder to decompile.[2]
Some compilers and post-compilation tools produce obfuscated code (that is, they attempt to produce output that is very difficult to decompile). This is done to make it more difficult to reverse engineer the executable.
Design
Decompilers can be thought of as composed of a series of phases each of which contributes specific aspects of the overall decompilation process.
Loader
The first decompilation phase loads and parses the input machine code or intermediate language program's binary file format. It should be able to discover basic facts about the input program, such as the architecture (Pentium, PowerPC, etc.) and the entry point. In many cases, it should be able to find the equivalent of the main function of a C program, which is the start of the user written code. This excludes the runtime initialization code, which should not be decompiled if possible. If available the symbol tables and debug data are also loaded. The front end may be able to identify the libraries used even if they are linked with the code, this will provide library interfaces. If it can determine the compiler or compilers used it may provide useful information in identifying code idioms.
Disassembly
The next logical phase is the disassembly of machine code instructions into a machine independent intermediate representation (IR). For example, the Pentium machine instruction
mov eax, [ebx+0x04]
might be translated to the IR
eax := m[ebx+4];
Idioms
Idiomatic machine code sequences are sequences of code whose combined semantics is not immediately apparent from the instructions' individual semantics. Either as part of the disassembly phase, or as part of later analyses, these idiomatic sequences need to be translated into known equivalent IR. For example, the x86 assembly code:
cdq eax ; edx is set to the sign-extension of eax
xor eax, edx
sub eax, edx
could be translated to
eax := abs(eax);
Some idiomatic sequences are machine independent; some involve only one instruction. For example, xor eax, eax clears the eax register (sets it to zero). This can be implemented with a machine independent simplification rule, such as a xor a = 0.
In general, it is best to delay detection of idiomatic sequences if possible, to later stages that are less affected by instruction ordering. For example, the instruction scheduling phase of a compiler may insert other instructions into an idiomatic sequence, or change the ordering of instructions in the sequence. A pattern matching process in the disassembly phase would probably not recognize the altered pattern. Later phases group instruction expressions into more complex expressions, and modify them into a canonical (standardized) form, making it more likely that even the altered idiom will match a higher level pattern later in the decompilation.
It is particularly important to recognize the compiler idioms for subroutine calls, exception handling, and switch statements. Some languages also have extensive support for strings or long integers.
Program analysis
Various program analyses can be applied to the IR. In particular, expression propagation combines the semantics of several instructions into more complex expressions. For example,
mov eax,[ebx+0x04]
add eax,[ebx+0x08]
sub [ebx+0x0C],eax
could result in the following IR after expression propagation:
m[ebx+12] := m[ebx+12] - (m[ebx+4] + m[ebx+8]);
The resulting expression is more like high level language, and has also eliminated the use of the machine register eax . Later analyses may eliminate the ebx register.
Data flow analysis
The places where register contents are defined and used must be traced using data flow analysis. The same analysis can be applied to locations that are used for temporaries and local data. A different name can then be formed for each such connected set of value definitions and uses. It is possible that the same local variable location was used for more than one variable in different parts of the original program. Even worse it is possible for the data flow analysis to identify a path whereby a value may flow between two such uses even though it would never actually happen or matter in reality. This may in bad cases lead to needing to define a location as a union of types. The decompiler may allow the user to explicitly break such unnatural dependencies which will lead to clearer code. This of course means a variable is potentially used without being initialized and so indicates a problem in the original program.
Type analysis
A good machine code decompiler will perform type analysis. Here, the way registers or memory locations are used result in constraints on the possible type of the location. For example, an and instruction implies that the operand is an integer; programs do not use such an operation on floating point values (except in special library code) or on pointers. An add instruction results in three constraints, since the operands may be both integer, or one integer and one pointer (with integer and pointer results respectively; the third constraint comes from the ordering of the two operands when the types are different).
Various high level expressions can be recognized which trigger recognition of structures or arrays. However, it is difficult to distinguish many of the possibilities, because of the freedom that machine code or even some high level languages such as C allow with casts and pointer arithmetic.
The example from the previous section could result in the following high level code:
struct T1 *ebx;
struct T1 {
int v0004;
int v0008;
int v000C;
};
ebx->v000C -= ebx->v0004 + ebx->v0008;
Structuring
The penultimate decompilation phase involves structuring of the IR into higher level constructs such as while loops and if/then/else conditional statements. For example, the machine code
xor eax, eax
l0002:
or ebx, ebx
jge l0003
add eax,[ebx]
mov ebx,[ebx+0x4]
jmp l0002
l0003:
mov [0x10040000],eax
could be translated into:
eax = 0;
while (ebx < 0) {
eax += ebx->v0000;
ebx = ebx->v0004;
}
v10040000 = eax;
Unstructured code is more difficult to translate into structured code than already structured code. Solutions include replicating some code, or adding boolean variables.
Code generation
The final phase is the generation of the high level code in the back end of the decompiler. Just as a compiler may have several back ends for generating machine code for different architectures, a decompiler may have several back ends for generating high level code in different high level languages.
Just before code generation, it may be desirable to allow an interactive editing of the IR, perhaps using some form of graphical user interface. This would allow the user to enter comments, and non-generic variable and function names. However, these are almost as easily entered in a post decompilation edit. The user may want to change structural aspects, such as converting a while loop to a for loop. These are less readily modified with a simple text editor, although source code refactoring tools may assist with this process. The user may need to enter information that failed to be identified during the type analysis phase, e.g. modifying a memory expression to an array or structure expression. Finally, incorrect IR may need to be corrected, or changes made to cause the output code to be more readable. |