Follow me on Twitter



I haven’t blogged for a few months. I’ve been busy finishing a prototype malware classification system based on flowgraph similarity. That has resulted in submitting a paper to the 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010) The system I developed and discussed in that submission is not fast enough for realtime use in desktop and EMail gateway AntiVirus.  To remedy that, I’ve also been working on a simpler flowgraph based classification system.  It detects less malware variants but performs in near realtime.  I’ve finished a basic prototype and hope to write up my results and submit to an ACM conference by the end of September.  I will write up more details about both systems after publication, which will be at earliest in January 2010.

Removing semantic NOP’s from Malware

A common obfuscation technique used by malware is by randomly inserting a sequence of instructions that have no other effect on the functionality of the program.  This technique is additionally used in polymorphic and metamorphic malware.

There are a couple of interesting papers that try to address this by identifying some classes of “semantic nops” in the malware and removing them.  The basic model is to identify a particular sequence or set of instructions to examine, typically generated by bruteforce of all possible combinations in each basicblock, and then ask an oracle if its a semantic nop.

There are two papers which are probably of interest.  Malware Normalization, is an in interesting paper and has several techniques to canonicalize a binary so that it can be more effective when used scanning for in antivirus.  Another paper of interest is which is a little more formal (and thus for me, harder to understand), on Semantics-Aware Malware Detection.  This paper performs static analysis on the malware and tries to match it to a template signature which describes particular behaviours of the malware such as its memory access.

The semantic nop oracle that can tell you if a sequence of code is a nop or not, is an SMT decision procedure (  SMT is an extension of SAT.  You give it a list of constraints, such as equalities and arithmetic, and then you can query the decision procedure to see if certain conditions can possibly be true or false.  Counter examples can be given in return to prove the results of the decision procedure.

To determine if a sequence of instructions is a semantic NOP, you ask the decision procedure if the set of registers that are modified by the instructions, result in the same values as what they were before hand.  If they are the same, then it’s clearly a NOP.  You also need to check that all memory written to is the same before and after as well.  Care should be taken with the status flags, and the papers mentioned don’t talk in depth about this. Perhaps checking for a lack of liveness in the flags or that the flags are the preserved at the end of a potential nop sequence. All these types of checks are only applied within basicblocks.  Detecting semantic nops as code that branches seems a much harder problem.

This appears to fine in most cases, but what those papers I mentioned earlier fail to describe is the case of exceptions being generated from memory access.  Consider the following code, which for all purposes appears to be a semantic NOP.

mov temp_mem, %eax; # potential exception
inc %eax
dec %eax
mov %eax,temp_mem

Note: status flag are not live at the end of the above code sequence making it eligable as a nop, which means there is a subsequent operation that overwrites the flags before using them.

That code indeed is a semantic NOP, _except_ when an exception occurs from accessing temp_mem.  If an exception occurs, then eax is clearly different from its starting condition when control is transferred to the exception handler.  This opens up a can of worms I think.

If we only look for semantic nops between memory accesses, we will undoubtedly miss a large number of cases that really are nops.  But if we ignore the potential for exceptions, then removing those misidentified nops will result in an unsound transformation.

Another problem was suggested to me, in terms of exceptions resulting from hardware break points. Removing a sequence of code, or applying a transformation to it, may not be sound.

PS.  I implemented a basic and incomplete version of dynamic taint analysis, and am using it to track the return value of GetProcAddress, so I can reconstruct the import table after auto unpacking.  It works fairly nicely.  And I implemented most of the code to transform my IR into STP constraints (STP is an opensource SMT decision procedure).

Interval Analysis and Register Tracking (Taint Analysis)

It’s been a busy day. describes several methods of finding bugs using static analysis.  It talks at one point about path insensitive interval analysis.  Interval analysis tries to discover the potential range or interval(s) a particular variable might have.  Path insensitive means that the interval it generates for a particular variable is quite conservative and contains values that would otherwise not be included if specific execution paths are taken into account.

Register Tracking is what Zynamics calls what appears to be taint analysis. talks about it.  It starts by tracking a particular register, and all other registers that are influenced by it.  I think this would be the basis for a (forward) program slice, but my knowledge of program slicing is pretty limited.  Slicing is all about finding instructions that are dependant or resultant of a particular source instruction.

The Zynamics slides above also talks about MonoREIL, which implements a dataflow analysis framework.  The basis for most dataflow analysis is founded around such formal constructs as monotonic functions and partially ordered lattices.  The original dataflow analysis model, is that each particular node has a set of dataflow equations, and the analysis tries to arrive at a solution for the equations.  To arrive at a solution, there are specific restrictions on the types of equations that can be analysed.  They need to be monotonic and they should be finite (excuse my simple and probably incorrect explanations).  The solution is derived typically through an iterative fixed-point algorithm.  The fixed point part is basically saying that after a particular number of iterations of analysis and applying the dataflow equations, the result stops changing, and this particular fixed point is the solution you want.  The monotonicity and the finite ceiling means that a fixed point solution always occurs.

I’m not the strongest in the theory, but none the less, as I posted sometime last week or so, I implemented a dataflow analysis framework on top of my IR (intermediate representation).  I used this framework to implement classic dataflow analysis such as reaching definitions and live variable analysis.  Today, I implemented interval analysis and register tracking.  I’m not convinced the interval analysis is working 100% so I’ll have to spend some more time looking over the results.  Interval analysis is not well suited to many of my IR instructions such as bitwise arithmetic (logical, shifts and rotates), but it is still worth applying in any case.

Zynamics also describes what they call “Range Tracking” to find negatively indexed arrays.  This is something I haven’t grasped yet but I’ll certainly go over it again and try an implementation in the future.

A rough implementation of the IR flowgraphs in the auto unpacker

Title says most of it.  I have now integrated into my emulator the IR component of the static analysis I will require to perform the control flow classification in my Masters research.  To help my debug, I generated some graphviz output of the callgraph and control flow graphs.  The following is a control flow graph from one of the procedures in the callgraph after auto unpacking to the binary hostname.exe which was packed by UPX .

The IR implementation is still somewhat buggy.  It doesn’t generate all that is required for the status flags handling in x86, and I also have quickly noticed that I have some inconsistency in how i use the StoreMem instruction.  But none the less, it should give an idea of what I’m doing.

I generate the callgraph and control flow graph based on the IR.  This differs to some implementations (like ERESI I believe), which build these graphs from the native assembly.  There is some redundancy between the disassembler and building the graphs from the IR, but this system allows me to use linear sweep disassembly when I need too.  I currently use linear sweep disassembly to disassemble any areas that are not covered by recursive traversal.  This technique is referred to as speculative disassembly.

A rough implementation of the instruction selection done

The topic says it all.   Basically I take the intermediate representation (IR) and translate this back into x86 assembly.  The IR I use is tree based.  I translate these trees representing expressions into x86 assembly, taking care to allocate registers and avoiding too many redundant copies.

The instruction selection works by a process of greedy tree matching.  The top of the tree is matched against a list of patterns.  The biggest patterns are matched first.  The top of the tree containing the pattern is then chopped off and an x86 instruction emitted and the subtrees are then recursively matched.  In actual fact,  generally the subtrees are recursed before emitting an instruction.   Registers can be allocated either from the leaves of the tree where a an expression is calculated with registers or immediates as operands.  Or register allocation can occur more to the top of the tree.  This is still a work in progress to generate more optimal code.

I still however assume an unlimited number of registers, so next to implement is register allocation.  I intend to use the linear scan algorithm which is supposedly a fast register allocation algorithm especially useful for dynamic binary translators and JIT compilers.  Other algorithms for register allocation use a graph colouring approach, but these can reportedly be quite slow.  After register allocation comes the assembling of x86 instructions into machine code.

There are still lots of problems to fix.  My code doesn’t work very nicely still with non sequential code (ie, code with more than one basic block).  Also, I can still generate IR sequences that result in semantic NOPS..

Incidentally, I have enough of the x86 -> IR code done so that it can now process all the instructions executed by the UPX packer.  But I still haven’t finished the status flags handling, so it’s not quite complete.

The woes of Get and Put

When writing an interpreter or a dynamic binary translator, it’s typical to maintain in memory a CPU context.  This is some memory set aside to store the CPU state such as register contents, the program counter and so forth.  In an interpreter, since you are only handling a single instruction at a time, you do all changes to the guest CPU registers directly to that CPU context stored in memory.  In dynamic binary translation, there is an opportunity to use real registers in the translated blocks.  But typically, at the end and beginning of the block, you transfer from/to the stored CPU context into those real registers.

But how does that transfer of registers to the CPU context actually work?

Valgrind uses a nice idea of having specific IR instructions set aside to do this transfer.  Get and Put.  You specifiy an offset in the CPU context, the size, and an IR register.  Then the CPU context contents are transferred.

A question is, should Get/Put be manually generated in the native code to IR translation?  But first, what are all the places Get/Put should be used?

At the beginning of a code fragment, registers need to be loaded.  So this is the main place to use Get.  At the end of the code fragment, or rather, at every exit point in the fragment, modified registers need to be Put back.

That is quite reasonable, but more annoying are places where control can be indirectly transferred.  Take memory loads and stores.  These operations, handled by callbacks in the translated code when using a software MMU implementation, can cause exceptions in the guest.  This means, that a load and store of memory is a point of exit for the code fragment.  Thus, it seems we need to Put the registers back before hand.

Puting back registers before every memory access is really annoying and seems likely to cause a bit of bloat in the translated fragments.  I don’t have a nicer solution however.  Valgrind doesn’t seem to have this problem, and I’m not quite sure how it avoids it.  More investigation is required..

So back to automatically generating Get and Put statements in the IR.  This can be done.  A simple but slightly incorrect solution is to Get all registers at the entry point that are used by the rest of code fragment.  Another simple solution is to Put all registers that are defined (written to) by the code fragment at each exit point. 

A problem exists that you might Put a register before its been defined, depending on the order of instructions.  I imagine the correct solution to this is only Put registers that are in the set of reaching definitions.

I didn’t implement the reaching definitions solution.  I used the more conservative approach of Getting the set of all registers that are used, AND defined.  That is faster to generate, but produces worse IR.

Incidentally, another problem for memory access is the saving of the lazy eflags registers.  These also need to be saved before a potential exception, else the SEH handler might have the wrong value in the SEH context.

In summary, alot of extra headaches are caused by potential exceptions in memory accesses.   Maybe there is a better solution, but it might require a more hackish approach than the one I’ve described.