Monthly Archives: April 2009

A rough implementation of the instruction selection done

The topic says it all.   Basically I take the intermediate representation (IR) and translate this back into x86 assembly.  The IR I use is tree based.  I translate these trees representing expressions into x86 assembly, taking care to allocate registers and avoiding too many redundant copies.

The instruction selection works by a process of greedy tree matching.  The top of the tree is matched against a list of patterns.  The biggest patterns are matched first.  The top of the tree containing the pattern is then chopped off and an x86 instruction emitted and the subtrees are then recursively matched.  In actual fact,  generally the subtrees are recursed before emitting an instruction.   Registers can be allocated either from the leaves of the tree where a an expression is calculated with registers or immediates as operands.  Or register allocation can occur more to the top of the tree.  This is still a work in progress to generate more optimal code.

I still however assume an unlimited number of registers, so next to implement is register allocation.  I intend to use the linear scan algorithm which is supposedly a fast register allocation algorithm especially useful for dynamic binary translators and JIT compilers.  Other algorithms for register allocation use a graph colouring approach, but these can reportedly be quite slow.  After register allocation comes the assembling of x86 instructions into machine code.

There are still lots of problems to fix.  My code doesn’t work very nicely still with non sequential code (ie, code with more than one basic block).  Also, I can still generate IR sequences that result in semantic NOPS..

Incidentally, I have enough of the x86 -> IR code done so that it can now process all the instructions executed by the UPX packer.  But I still haven’t finished the status flags handling, so it’s not quite complete.

Advertisements

The woes of Get and Put

When writing an interpreter or a dynamic binary translator, it’s typical to maintain in memory a CPU context.  This is some memory set aside to store the CPU state such as register contents, the program counter and so forth.  In an interpreter, since you are only handling a single instruction at a time, you do all changes to the guest CPU registers directly to that CPU context stored in memory.  In dynamic binary translation, there is an opportunity to use real registers in the translated blocks.  But typically, at the end and beginning of the block, you transfer from/to the stored CPU context into those real registers.

But how does that transfer of registers to the CPU context actually work?

Valgrind uses a nice idea of having specific IR instructions set aside to do this transfer.  Get and Put.  You specifiy an offset in the CPU context, the size, and an IR register.  Then the CPU context contents are transferred.

A question is, should Get/Put be manually generated in the native code to IR translation?  But first, what are all the places Get/Put should be used?

At the beginning of a code fragment, registers need to be loaded.  So this is the main place to use Get.  At the end of the code fragment, or rather, at every exit point in the fragment, modified registers need to be Put back.

That is quite reasonable, but more annoying are places where control can be indirectly transferred.  Take memory loads and stores.  These operations, handled by callbacks in the translated code when using a software MMU implementation, can cause exceptions in the guest.  This means, that a load and store of memory is a point of exit for the code fragment.  Thus, it seems we need to Put the registers back before hand.

Puting back registers before every memory access is really annoying and seems likely to cause a bit of bloat in the translated fragments.  I don’t have a nicer solution however.  Valgrind doesn’t seem to have this problem, and I’m not quite sure how it avoids it.  More investigation is required..

So back to automatically generating Get and Put statements in the IR.  This can be done.  A simple but slightly incorrect solution is to Get all registers at the entry point that are used by the rest of code fragment.  Another simple solution is to Put all registers that are defined (written to) by the code fragment at each exit point. 

A problem exists that you might Put a register before its been defined, depending on the order of instructions.  I imagine the correct solution to this is only Put registers that are in the set of reaching definitions.

I didn’t implement the reaching definitions solution.  I used the more conservative approach of Getting the set of all registers that are used, AND defined.  That is faster to generate, but produces worse IR.

Incidentally, another problem for memory access is the saving of the lazy eflags registers.  These also need to be saved before a potential exception, else the SEH handler might have the wrong value in the SEH context.

In summary, alot of extra headaches are caused by potential exceptions in memory accesses.   Maybe there is a better solution, but it might require a more hackish approach than the one I’ve described.

NotLive IR instruction

Maybe this is too small a thing to blog about. I added a NotLive instruction to my IR. Its essentially an annotation to mark a register as not being live. This means it will not be read again before being overwritten. This allows for some optimisations to occur on my IR -> TreeIR without the expense of liveness analysis. The NotLive instruction is entirely optional. There is no problem if it’s left out – it simply means further optimisations will not be possible. The use of the instruction can lead to an inconsistancy in the code if the register marked as dead (not live) is then read from. Good use of the instruction should avoid this.

An example from yesterday without NotLive –>

t2 = AddImm32(%40, $0x0)
t1 = LoadMem8(%t2)
t0 = And8(%t1, $0x7f)
StoreMem8(And8(LoadMem8(Add32(%8, %eax)), $0x7f), Add32(%8, %eax))
StoreMem8(%t0, %t2)
Load8(%t0, %temp_op1b)                #can be removed
Load8($0x7f, %temp_op2b)              #can be removed
Load32(%t2, %temp_memreg)           #can be removed
Load32(Widen8_To_SXLo8In32(%t0), %lazyEFlagsResult)
Load32(Widen8_To_SXLo8In32(%t1), %lazyEFlagsOp1)
Load32($0x7f, %lazyEFlagsOp2)

 

You can see those Load’s to temp_op1, temp_op2b and temp_memreg are actually not necessary. They are just artifacts from the IR generation and no longer necessary once the basic block has been completed. They are intermediaries.

# ... [skip ]
# Widen8_To_SXLo8In32 (%temp_op1b, , %lazyEFlagsOp1)
# LoadImm8 ($0x7f, , %temp_op2b)
# LoadImm32 ($0x7f, , %lazyEFlagsOp2)
# And8 (%temp_op1b, %temp_op2b, %temp_op1b)
# Widen8_To_SXLo8In32 (%temp_op1b, , %lazyEFlagsResult)
# StoreMem8 (%temp_op1b, , %temp_memreg)
# NotLive (%temp_memreg, , )
# NotLive (%temp_op1b, , )
# NotLive (%temp_op2b, , )

 

Note the use of NotLive to mark those intermediaries as no longer live. Which results in the following tree IR.

t2 = AddImm32(%40, $0x0)
t1 = LoadMem8(%t2)
t0 = And8(%t1, $0x7f)
StoreMem8(And8(LoadMem8(Add32(%8, %eax)), $0x7f), Add32(%8, %eax))
StoreMem8(%t0, %t2)
Load32(Widen8_To_SXLo8In32(%t0), %lazyEFlagsResult)
Load32(Widen8_To_SXLo8In32(%t1), %lazyEFlagsOp1)
Load32($0x7f, %lazyEFlagsOp2)

 

That produced quite nice code. I think that is near optimal. Of course your wondering why there is an AddImm(%40,$0) which is equivalent to a NOP. This is due to a lack of an optimisation pass over the linear IR once it’s been generated. I’ll save that for another day..

Since I’ve been blogging about how great this tree IR is, I _really_ hope there aren’t too many major bugs in it which cause it be killed off. I really have done little debugging in terms of verifying the output – except from the casual glance to see that is approximately correct.

First implementation of IR -> Tree IR

The pastes below arent functional nor are they final, or even 100% correct.  But they are kinda cool.

I haven’t tested them fully so they might be really wrong.

# Label (*0x0, , )
# Load32 (%8, , %temp_memreg)
# Add32 (%temp_memreg, %eax, %temp_memreg)
# LoadMem8 (%temp_memreg, , %temp_op1b)
# Widen8_To_SXLo8In32 (%temp_op1b, , %lazyEFlagsOp1)
# LoadImm8 ($0x7f, , %temp_op2b)
# LoadImm32 ($0x7f, , %lazyEFlagsOp2)
# And8 (%temp_op1b, %temp_op2b, %temp_op1b)
# Widen8_To_SXLo8In32 (%temp_op1b, , %lazyEFlagsResult)
# StoreMem8 (%temp_op1b, , %temp_memreg)
# Load32 (%8, , %temp_memreg)
# Add32 (%temp_memreg, %eax, %temp_memreg)
# AddImm32 (%40, $0x0, %temp_memreg)
# LoadMem8 (%temp_memreg, , %temp_op1b)
# Widen8_To_SXLo8In32 (%temp_op1b, , %lazyEFlagsOp1)
# LoadImm8 ($0x7f, , %temp_op2b)
# LoadImm32 ($0x7f, , %lazyEFlagsOp2)
# And8 (%temp_op1b, %temp_op2b, %temp_op1b)
# Widen8_To_SXLo8In32 (%temp_op1b, , %lazyEFlagsResult)
# StoreMem8 (%temp_op1b, , %temp_memreg)

now the TreeIR version -->

t2 = AddImm32(%40, $0x0)
t1 = LoadMem8(%t2)
t0 = And8(%t1, $0x7f)
StoreMem8(And8(LoadMem8(Add32(%8, %eax)), $0x7f), Add32(%8, %eax))
StoreMem8(%t0, %t2)
Load8(%t0, %temp_op1b)
Load8($0x7f, %temp_op2b)
Load32(%t2, %temp_memreg)
Load32(Widen8_To_SXLo8In32(%t0), %lazyEFlagsResult)
Load32(Widen8_To_SXLo8In32(%t1), %lazyEFlagsOp1)
Load32($0x7f, %lazyEFlagsOp2)

Another example..
# Label (*0x0, , )
# Load32 (%ebx, , %temp_op1d)
# Load32 (%ebx, , %lazyEFlagsOp1)
# LoadImm32 ($0x4, , %temp_op2d)
# LoadImm32 ($0x4, , %lazyEFlagsOp2)
# Add32 (%temp_op1d, %temp_op2d, %temp_op1d)
# Load32 (%temp_op1d, , %lazyEFlagsResult)
# Load32 (%temp_op1d, , %ebx)

and the TreeIR -->

t0 = Add32(%ebx, $0x4)
Load32(%t0, %ebx)
Load32(%t0, %temp_op1d)
Load32($0x4, %temp_op2d)
Load32(%t0, %lazyEFlagsResult)
Load32(%ebx, %lazyEFlagsOp1)
Load32($0x4, %lazyEFlagsOp2)


The Load32/Load16/Load8 at the end of each basic block is important in the TreeIR.  It’s the writing back to real registers what it should be on exit.  This is inconsistant to how the temporaries are assigned.  Also, the temoraries dont have sizes associated with them yet but this isnt a major problem.

There is a slight unoptimisation in that things like temp_op1, temp_op2 etc are also written, when they are infact no longer live.  This could be eliminated with liveness analysis, but solving the data flow equations are too expensive I suspect to be used in binary translation.  I am thinking of introducing a NotLive instruction in my IR to help me with this scenario.

This is almost useful for a decompiler 😉

Iterative solutions to dataflow problems

The iterative solution is a well known method for solving data flow equations.  The iterative solution processes nodes by applying the transfer function and join operator, and repeats until the state of each node reaches a fixed point – that is, there is no change of state in the node after the transfer+join.

There are a different ways of implement the iterative solution.  The round robin is the simplest, and that is done by traversing or iterating through each node in a single pass applying the data flow equations (transfer function and join).  If any node has a change in state, the process is repeated.  My first implementation of this ignored the ordering of the nodes and control flow graph completely.

Today I thought I would make a faster version using the worklist approach.  The worklist approach starts off with all nodes in a worklist.  Then a node is removed from the worklist and processed.  If the state changes, then its successors (for forward dataflow problems) are added to the worklist.  Another item is taken from the worklist if one exists and the process repeats until the list is empty.  I implemented that earlier today.

It turns out that my worklist approach isn’t at all well implemented.  I was requiring processing of 9000 nodes at one point.  Turned out I wasn’t ignoring duplicates in the worklist.  Fixed.  Now it took 1900 nodes to reach a fixed point.

BUT.. the round robin approach implemented by traversing the CFG is much faster.  It only took 580 odd nodes to process.  The order of the traversal is supposed to be quote important.  For forward data flow problems, reverse post order is used, where the node is processed before its successors.  There is a modification to this indicated on wikipedia which talks about exceptions involving backedges.  I haven’t implemented that.

It’s just a note to say that sometimes when you think you’ve got the faster algorithm, its not certain until you test it.  I’m sure its possible to get the worklist approach to be faster than the round robin, and infact a paper I’m reading talks about the naive worklist algorithm being slower than the round robin method.

I think that it seems even if I implement a better algorithm, data flow analysis is still too slow to be used in a dynamic binary translator.

Early implementation of a dataflow analysis framework

That tree based IR was causing me grief, and at one point I decided that the solution was to convert my IR into SSA form. And then I started thinking I needed a data flow analysis framework. After a day or two or coding I’ve got some initial results on the data flow analysis but still haven’t been able to generate a tree based IR.

The data flow analysis framework has only reaching definitions implemented so far, but it should be fairly easy now to add new types of analysis such as live variables or available expressions. I have 3 main handlers, Join, TransferFunction, and a final function to move the results from the in state to the out state and check if the state has changed. That final function is useful for solving the dataflow equations using an iterative solution.

Along the way I had to introduce some other bits and pieces. My IR works in a kind of stack. Each IR instructions gets handed down layers in the stack. The first layer is handed IR instructions that are generated from the translation of native code. The next layer could be used to provide a storage mechanism of each instruction.

I added another layer today in my code to convert single exit multiple exit blocks into single entry, single exit blocks. It’s much easier to write code when you don’t have to use explicit labels on every basic block, but for automated analysis, those demarcations are essential. I also wrote code to support the control flow graphs – this is processed by another layer on the IR stack. It takes in a sequence of IR instructions, and providing each basic block is labelled (which now is generated by the single entry single exit code), it builds a control flow graph.

I’m worried that all this analysis is not going to be viable for use in the dynamic binary translation as it’s really quite computationally expensive, but at least I have the code for other things also.

The start of a tree based IR, and some problems with basic blocks

I’ve been writing code to convert three address code, a linear IR , to a structured IR, a tree based IR.  I’m doing this with the hopes of better instruction selection using the structured IR.  The tree IR is essentially an expression tree, but it also has to deal with statements (which for me, are instructions which don’t have a result).

It hasn’t seemed like a large task to construct an expression tree from the instructions for the most part.  There are some interesting things that cropped up however.

Common sub-expressions are evident in the tree as nodes with more than one parent.  To output the expression below as plain text requires some thought, eg

Put32(Add32(Add32(ebx,ecx),Add32(ebx,ecx)),0)

[note: Put32 and Get32 are Valgrind/VEX style instructions that transfer a value to/from a stored context (in Valgrind, this stores CPU state like registers etc)].

If there is a common sub-expression, then this should be placed in a temporary, and the temporary used instead of that particular sub-expression in the code. In the code I showed above, it should be

t0 = Add32(ebx,ecx) Put32(Add32(t0, t0), 0)  


This brought up the problem I’m having with conditional branches. Generating the tree IR only seems to really work in the context of a basic block. And it’s quite possible for my three
address code IR to generate multiple basic blocks from a single native instructions. For instance, I explictly generate an iterative loop when representing a rep movs etc.

To co-ordinate the interaction between basic blocks, I’m having to explicitly assign the value of expressions to their originally assigned register. Remember that by the original IR to an expression in the tree IR, I’m essentially eliminating those registers, eg

Put32(Add32([eax=]Add32(eax, ebx), ebx), 0)


The [eax=] isnt really required.

Anyway, I have to add that explicit assignment in between basic blocks and I do this immediately before a conditional branch is made. But which registers and expressions should be explicitly assigned? Its at most the set of registers that were written to in the original code to build the expressions. To reduce this set even further, I imagine that a liveness analysis could give the set of minimal registers. The liveness analysis would tell us which registers would be used in the future. Clearly, if the register isn’t used we do not need to remember it.

But this all seems a bit over the top. I don’t really want to do a liveness analysis just to convert my linear IR to a tree based IR. This translation has to be fast if its going to be effective in a dynamic binary translator. Anyway, I’m doubtfully going to implement the liveness analysis at this point, but if I’m feeling particularly motivated one day after the rest of the code works, who knows..

I’ll try to finish the tree IR in the next couple days, and have a look at the output it generates before making up my mind on whether the new IR is worth the effort.