Silviocesare’s Weblog

Iterative solutions to dataflow problems

April 15, 2009 · Leave a Comment

The iterative solution is a well known method for solving data flow equations.  The iterative solution processes nodes by applying the transfer function and join operator, and repeats until the state of each node reaches a fixed point – that is, there is no change of state in the node after the transfer+join.

There are a different ways of implement the iterative solution.  The round robin is the simplest, and that is done by traversing or iterating through each node in a single pass applying the data flow equations (transfer function and join).  If any node has a change in state, the process is repeated.  My first implementation of this ignored the ordering of the nodes and control flow graph completely.

Today I thought I would make a faster version using the worklist approach.  The worklist approach starts off with all nodes in a worklist.  Then a node is removed from the worklist and processed.  If the state changes, then its successors (for forward dataflow problems) are added to the worklist.  Another item is taken from the worklist if one exists and the process repeats until the list is empty.  I implemented that earlier today.

It turns out that my worklist approach isn’t at all well implemented.  I was requiring processing of 9000 nodes at one point.  Turned out I wasn’t ignoring duplicates in the worklist.  Fixed.  Now it took 1900 nodes to reach a fixed point.

BUT.. the round robin approach implemented by traversing the CFG is much faster.  It only took 580 odd nodes to process.  The order of the traversal is supposed to be quote important.  For forward data flow problems, reverse post order is used, where the node is processed before its successors.  There is a modification to this indicated on wikipedia which talks about exceptions involving backedges.  I haven’t implemented that.

It’s just a note to say that sometimes when you think you’ve got the faster algorithm, its not certain until you test it.  I’m sure its possible to get the worklist approach to be faster than the round robin, and infact a paper I’m reading talks about the naive worklist algorithm being slower than the round robin method.

I think that it seems even if I implement a better algorithm, data flow analysis is still too slow to be used in a dynamic binary translator.

→ Leave a CommentCategories: Uncategorized

Early implementation of a dataflow analysis framework

April 14, 2009 · Leave a Comment

That tree based IR was causing me grief, and at one point I decided that the solution was to convert my IR into SSA form. And then I started thinking I needed a data flow analysis framework. After a day or two or coding I’ve got some initial results on the data flow analysis but still haven’t been able to generate a tree based IR.

The data flow analysis framework has only reaching definitions implemented so far, but it should be fairly easy now to add new types of analysis such as live variables or available expressions. I have 3 main handlers, Join, TransferFunction, and a final function to move the results from the in state to the out state and check if the state has changed. That final function is useful for solving the dataflow equations using an iterative solution.

Along the way I had to introduce some other bits and pieces. My IR works in a kind of stack. Each IR instructions gets handed down layers in the stack. The first layer is handed IR instructions that are generated from the translation of native code. The next layer could be used to provide a storage mechanism of each instruction.

I added another layer today in my code to convert single exit multiple exit blocks into single entry, single exit blocks. It’s much easier to write code when you don’t have to use explicit labels on every basic block, but for automated analysis, those demarcations are essential. I also wrote code to support the control flow graphs – this is processed by another layer on the IR stack. It takes in a sequence of IR instructions, and providing each basic block is labelled (which now is generated by the single entry single exit code), it builds a control flow graph.

I’m worried that all this analysis is not going to be viable for use in the dynamic binary translation as it’s really quite computationally expensive, but at least I have the code for other things also.

→ Leave a CommentCategories: Uncategorized

The start of a tree based IR, and some problems with basic blocks

April 12, 2009 · 2 Comments

I’ve been writing code to convert three address code, a linear IR , to a structured IR, a tree based IR.  I’m doing this with the hopes of better instruction selection using the structured IR.  The tree IR is essentially an expression tree, but it also has to deal with statements (which for me, are instructions which don’t have a result).

It hasn’t seemed like a large task to construct an expression tree from the instructions for the most part.  There are some interesting things that cropped up however.

Common sub-expressions are evident in the tree as nodes with more than one parent.  To output the expression below as plain text requires some thought, eg

Put32(Add32(Add32(ebx,ecx),Add32(ebx,ecx)),0)

[note: Put32 and Get32 are Valgrind/VEX style instructions that transfer a value to/from a stored context (in Valgrind, this stores CPU state like registers etc)].

If there is a common sub-expression, then this should be placed in a temporary, and the temporary used instead of that particular sub-expression in the code. In the code I showed above, it should be

t0 = Add32(ebx,ecx) Put32(Add32(t0, t0), 0)  


This brought up the problem I’m having with conditional branches. Generating the tree IR only seems to really work in the context of a basic block. And it’s quite possible for my three
address code IR to generate multiple basic blocks from a single native instructions. For instance, I explictly generate an iterative loop when representing a rep movs etc.

To co-ordinate the interaction between basic blocks, I’m having to explicitly assign the value of expressions to their originally assigned register. Remember that by the original IR to an expression in the tree IR, I’m essentially eliminating those registers, eg

Put32(Add32([eax=]Add32(eax, ebx), ebx), 0)


The [eax=] isnt really required.

Anyway, I have to add that explicit assignment in between basic blocks and I do this immediately before a conditional branch is made. But which registers and expressions should be explicitly assigned? Its at most the set of registers that were written to in the original code to build the expressions. To reduce this set even further, I imagine that a liveness analysis could give the set of minimal registers. The liveness analysis would tell us which registers would be used in the future. Clearly, if the register isn’t used we do not need to remember it.

But this all seems a bit over the top. I don’t really want to do a liveness analysis just to convert my linear IR to a tree based IR. This translation has to be fast if its going to be effective in a dynamic binary translator. Anyway, I’m doubtfully going to implement the liveness analysis at this point, but if I’m feeling particularly motivated one day after the rest of the code works, who knows..

I’ll try to finish the tree IR in the next couple days, and have a look at the output it generates before making up my mind on whether the new IR is worth the effort.

→ 2 CommentsCategories: Uncategorized

Instruction Selection

April 11, 2009 · Leave a Comment

I have implemented a reasonable number of instructions now doing x86 -> ir.  The differences between this compared to REIL, are the use of explicit instructions for each combination of operand type, eg AddImm8 and Add32.  The use of instructions which could be implemented using simpler operations, eg, Loading the 2nd byte of a dword into an 8 bit register (think ah/bh/ch/dh).  The use of many conditional branches (Je/Jne/Jz/Jnz/Jl/Jle/Jg/Jge).  The use of a Dispatch_Jz/Dispatch/Jnz conditional etc that uses a virtual address instead of a label.  Not marking each instruction with an address of its own, but using labels.  And marking labels using a Label instruction.  The use of a RaiseException instruction.  Handling of segmentation and checking segment limits – but still using linear addresses in ReadMem/StoreMem. 

Those are the major differences, and I certainly haven’t finished writing all the code yet, so it may change in the future.  But the instructions I’ve done so far are working out OK and I’m generating some nice output.   I’ve spent some time this morning thinking if I should have Put/Get instructions like Valgrind/VEX which copy to and from a guest register/context area.  Will think some more..

Which got me thinking about the next phase of my code, translating the IR back into x86. ( In the final implementation there will be an optimisation phase on the IR, but it’s not so important for now and I will focus on the code generation)

My IR is a linear IR implemented as three address code.  The naive approach to code generation is to convert is to translate each individual IR instruction to x86 seperately.  The problem with this, is that I suspect it will produce very bad code.

AddImm32 $1,,%eax
Add32 %eax,%ebx,%eax # eax = eax + ebx
LoadMem32 %eax,,%ecx # ecx = LoadMem32(eax)
// annotation: eax no longer live

Translating one instruction at a time to x86 will produce something like this –>

inc %eax
add %ebx,%eax
mov (%eax),%ecx

But really, what we want is something like this –>

mov 1(%eax,%ebx), %ecx

This seems to be hard to generate from three address code. A tree based IR seems much more able for this type of instruction selection. 

Valgrind uses a tree based IR internally (though it alternatives between a linear IR which it calls a flat or flattened IR.  QEMU uses a linear IR.

Maybe three address code will work out OK and I’m worrying about nothing, but I’m not entirely convinced.  Maybe for the first implementation I will just use three address code.

The tree IR seems interesting though for static analysis also in the sense that the representation inherently is able to support instruction re-ordering and register renaming (to an extent).  I’m really exaggerating its ability to handle this in the sense of being able to use it for identifying equivalence in some types of code blocks, but I suspect it can make the basis of a useful approach.  Think code normalization and malware detection of variants.  Not sure if this is plausible.  Something for the future..

→ Leave a CommentCategories: Uncategorized

Attending the University Academic Awards Presentation

April 8, 2009 · 1 Comment

Just got back from CQUniversitie’s academic awards night.  Had an open bar before and after the presentation and nibbly type food at the end.  I received the Queensland Alumina Ltd prize for highest GPA in Information Systems Project Management.  It was a pretty nice event.. Much better than the last time I received an award in 1994, when it was a fairly non fuss thing in a tiny room without very few people.  Tonight, this year for the first time they put all the awards presentations for each faculty together, and made it much more grander and official – hence the drinks and food – Even the Mayor attended.. Yes, I live in a tiny town..

I’m still working on the x86->ir code which I’ve been procrastinating about.  Also been doing some small attempts at refactoring my code.  40,000 lines (including public domain code I imported such as libdasm).. Splitting up giant source files, moving files into separate directories etc..  I’ll probably also write up this week a basic outline of the literature review I have to do for my degree.

→ 1 CommentCategories: Uncategorized

My research application is accepted, and I’ve accepted the letter of offer

March 31, 2009 · 4 Comments

The topic says it all.  I changed the start date of the research from 1st May to 31st March (today) at the Universities request.  Clearly they want to increase their admission numbers for this term.  The degree is expected to be completed in 1.5 years.  Maximum time is 2 years and minimum 1 year.

→ 4 CommentsCategories: Uncategorized

My Research Application

March 27, 2009 · 1 Comment

I submitted my research application to the University (http://www.cqu.edu.au) today.  It is for a Masters (in Informatics) degree.  The research topic I proposed is “Fast Automated Malware Unpacking and Classification”.  This will be based on the emulator and unpacker I’ve been working on and blogging about.  Things I will do for the research that I haven’t done yet, is to implement dynamic binary translation, and use the structural classification based on control flow used/documented by Zynamics/Halvar Flake etc which they say is not for the desktop, and make it usable for realtime desktop on-access applications.  I will also integratethe ClamAV signature database.  Its basically going to be a fast anti-malware system.  I’m not sure if I will make public the source code, but I have talked to my (to be) supervisor, and he is fine with it if I choose to do that.

→ 1 CommentCategories: Uncategorized

What appears to be a QEMU segment limit bug, and some DBT implementation thoughts

March 23, 2009 · 3 Comments

[Edit:  I received an Email from Otto Ebeling who told me that this is a known and documented limitation of QEMU.  But practically no OS would require these segment limit or access checks in any case.]

First up, an unverified QEMU bug handling segmented memory.  From looking quickly at the code QEMU has special handling for checking the segment limit in places say like, executing code in the code segment.  In other places such as what happens during a movs instruction, it calculates a linear address by adding the segment base to the address/offset held in esi or edi.  It doesn’t check the segment limit however (or even check for an integer overflow in adding the segment base for that matter).  This to me seems like a bug.  If you set up 2 segments, it could be possible to reference the 2nd segment from the 1st by overflowing the 1st segments limit.

Fixing this bug seems to be quite tricky.  The correct behaviour I think is to generate an access violation/exception when trying to dereference a segment:offset.  But QEMU generates translated code that calculates the linear address and holds it in its cache.  After calculating the address, at some point the software MMU is called using the linear address, and this is when the exception should be raised.

I had this same bug in my interpreted emulator although I knew of the problem when I first wrote the code, but then forgot about it.  The fix was pretty straight forward.  To fix the problem in the dynamic binary translation code I’m working on seems a little harder.  The ReadMemory function I use in the emulator actually takes a segment selector as a parameter, so I could simply pass the segment selector instead of trying to calculate a linear address from the segment base.  This is probably the easiest fix, but it means that the IR i’m working on becomes really ugly.  I don’t think its wise to use an IR that has segmentation, and it would be annoying to use such an IR in other projects..  I’m undecided how I’ll try to fix the segment limit problem for now..

I also spent a fair amount of time deciding about the problem of register aliasing.  In x86, al, ax, and eax all reference the same register.  They are register aliases.  I wasn’t sure if my IR should handle everything as 32bit, and not use different size operands.  I decided against this because I would lose some kinds of information.  The type reconstruction in a decompiler would seem to need the original operand sizes, and I wanted my IR to be potentially useful if I tried implementing this in the future.  I have also decided on using IR instructions that extract the lo/hi bytes akin to al/ah etc from a 32bit word, and also extracting the lower 16bits of a 32bit word.  This will make for better code generation for the DBT to have such instructions.  Valgrind actually has similar functions, except to get say ah from eax, it extracts the lower 16 bits first in 1 instruction, then the high byte from that 16bits in another instruction.

I have also currently decided to use registers in my IR instead of a block of memory representing the guest state like Valgrind uses.  Maybe the Valgrind approach is better, but for now I’m using registers.

REIL (the IR used in binnavi/zynamics etc) doesnt seem to have instructions that do the type of casts that I’m talking about above.  I only have the public online REIL documentation, so clearly they must have equivalent but for now I can’t see it documented.

→ 3 CommentsCategories: Uncategorized

Trying to figure out register allocation in dynamic binary translation

March 19, 2009 · 2 Comments

I’ve been trying to figure out what I need to do to implement dynamic binary translation in my emulator.

The most common theme when translating a basic block, is to disassemble each instruction.  From this generate an intermediate representation.  After the basic block is converted to this IR, perform any reasonable optimisations on the new code.  Then when done, generate native code from the IR.

The IR might use a small number of registers, and these registers can be statically mapped to a host.  QEMU follows this approach.  Guest registers are kept in memory.  This is probably the simplest approach to register allocation, but I suspect the native code QEMU generates can be greatly improved by keeping guest registers in real registers on the host and spilling when necessary.

So QEMU when faced with the x86 instruction, inc %ebx; inc %ebx, it generates IR similar to the following.


mov &context->ebx,%TEMP1
inc %TEMP1
mov %TEMP1, &context->ebx
mov &context->ebx,%TEMP1
inc %TEMP1
mov %TEMP1, &context->ebx

 
That’s not so bad but when you consider that most instructions in the basic block access the same register, it would be really nice if the IR generated code like this –>


mov &context->ebx,%ebx;
inc %ebx
inc %ebx
mov %ebx, &context->ebx

Clearly, there are advantages to using the QEMU approach, as register allocation problems are practically eliminated and at most only those temporaries defined by the IR are used in each host.

I haven’t looked in great depth at Valgrind, but it seems to use the more optimal approach to register usage and tries to map guest and host registers using a register allocation scheme. This is something I have to look into further.

But back to _my_ problem in designing an implementation. Let’s look at the x86 instruction    add %eax,10(%ebx,%ecx,4)

If our IR consists more of micro operations that reasonably faithfully translate the original instruction and not using the temporaries tricks that QEMU uses, a translation could result in something like this


mov %ecx,%temp1
mul 4, %temp1
add %ebx,%temp1
add 10,%temp1
LoadMem %temp1,%temp2
add %eax,%temp2
StoreMem %temp2,%temp1

All that work just to simplify an address mode!

As you can see, a problem is that we now require an extra 2 registers!  If we went about translating that IR back to x86, we would not be able to statically map all the registers without spillage.  I suppose we could always reserve 2 or so registers in the host, which would be alot easier than a register allocation algorithm..

The above is still a simplification.. ReadMem and StoreMem are most likely going to be functions in our software MMU implementation.. so this could result in the following –>


mov %ecx,%temp1
mul 4, %temp1
add %ebx,%temp1
add 10,%temp1
push %eax; // ?? eax is the return value of LoadMem, AND a guest/host register
push %temp1
call LoadMem
mov %eax,%temp2
pop %eax
add %eax,%temp2
push %temp2
push %temp1
call StoreMem

What about the return value of LoadMem? This uses up _another_ register! That’s 3 registers. Maybe another solution is to have LoadMem always store the memory contents not as a register, but in memory. This saves us a register, but may or may not be slower or faster.

I also had a look at the http://www.zynamics.com IR which they call REIL (reverse engineering intermediate language). An interesting aspect is they have only one conditional instruction, set if boolean zero. I guess they went to some work to change the eflags (or other status flags/condition codes for non x86) evaluation to be compatible with that. They clearly went for the simpler IR is the best approach.

OK. I’ve been rambling for long enough.. I highly doubt anyone would find this interesting besides me.

→ 2 CommentsCategories: Uncategorized

Finished an implementation of FProt hybrid unpacker paper

March 15, 2009 · Leave a Comment

I’ve been working on implementing that FProt paper I talked about in my last post.  If you missed it – http://www.datasecurity-event.com/uploads/hybrid_approach.pdf.  The basic idea is to scan the binary for known code chunks, and instead of generic emulation,  implementing special handlers (or mini-emulators as the FProt paper describes them).

I implemented the Multi-Pattern Matching algorithm according to a paper by Wu and Manber http://webglimpse.net/pubs/TR94-17.pdf.  Snort once used a variation of this algorithm known as “mwm” – modified Wu Manber.  My implementation of the paper differs slightly from the original paper, and although I won’t outright say some details in the paper are wrong, I had to use my own interpretation of the algorithm for it all to work.

The algorithm is similar to the Boyer-Moore string matching algorithm http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm.  The Wu-Manber algorithm uses a hash of the last characters of the buffer to match to give a list of possible matches.  The bad character shift will be a minimum of shifts for each possible match given by the hash.  If that’s confusing, then look at how the Boyer Moore algorithm works and it will make more sense.

The results of the pattern matching were quite good, with a very minor performance impact, which allowed me to continue implementing the rest of the FProt paper without too many worries about worst case performance if no known code chunks are found.

From an instruction trace of my sample packed binaries, I could fairly easily see where most of the time in execution was being spent.  Basically I looked for code that was writing to memory where the decompressed binary would end up in.  These would loop and be part of the decompression functions.  I then looked for entry points (the code leading up to entering the loop) into that code and copied the complete decompression code into a flat text file.

I think the process of automatic extraction of decompression loops and checksum functions can potentially be automated.  Certainly its fairly easy to see the memory access of a decompression loop by looking at the range of memory writes that occur from that location.  Complete automation is something I will look into the future.  It would be really nice to see automatically generated static unpackers, and I really think such a thing is possible.

I then decompiled the asm decompression code, and ported it to my emulator.  This results in something resembling a static unpacker or mini-emulator.  An interesting thing to note, is that many packers share similar code to handle decompression.  aplib http://www.ibsensoftware.com/products_aPLib.html and minor variations of it are used in a number of packers I have.  Infact, aplib is used in mew/fsg/rlpack/packman and probably others.  It would have been nice to rip the code from the aplib sources, however, it’s not public domain so I decided to reverse it from the assembly in the packed binary.  This is also advantageous because I couldn’t get perfect matches for all the packed samples I had against the aplib versions I had.

Alot of debugging followed, and for the one binary I was targeting (rlpack), I had to include another chunk of code for calculating a checksum of the libraries to get a good performance increase.  But in the end, it all worked out quite well.

The result for rlpack was a performance increase of over 50% over complete generic emulation.  Unpacking rlpack-calc.exe took over .52 seconds using generic emulation, and with the mini-emulator code, it took about .25 seconds.  And remember that in the worst case and no code is identified, generic emulation occurs as usual.

There is still lots to do and many mini-emulators to write.  The question if dynamic binary translation is a completely superior approach is quite open to me.  I think using known code chunks and mini-emulators is faster, but at the price of invested development time.

→ Leave a CommentCategories: Uncategorized