Monthly Archives: March 2009

My research application is accepted, and I’ve accepted the letter of offer

The topic says it all.  I changed the start date of the research from 1st May to 31st March (today) at the Universities request.  Clearly they want to increase their admission numbers for this term.  The degree is expected to be completed in 1.5 years.  Maximum time is 2 years and minimum 1 year.

My Research Application

I submitted my research application to the University ( today.  It is for a Masters (in Informatics) degree.  The research topic I proposed is “Fast Automated Malware Unpacking and Classification”.  This will be based on the emulator and unpacker I’ve been working on and blogging about.  Things I will do for the research that I haven’t done yet, is to implement dynamic binary translation, and use the structural classification based on control flow used/documented by Zynamics/Halvar Flake etc which they say is not for the desktop, and make it usable for realtime desktop on-access applications.  I will also integratethe ClamAV signature database.  Its basically going to be a fast anti-malware system.  I’m not sure if I will make public the source code, but I have talked to my (to be) supervisor, and he is fine with it if I choose to do that.

What appears to be a QEMU segment limit bug, and some DBT implementation thoughts

[Edit:  I received an Email from Otto Ebeling who told me that this is a known and documented limitation of QEMU.  But practically no OS would require these segment limit or access checks in any case.]

First up, an unverified QEMU bug handling segmented memory.  From looking quickly at the code QEMU has special handling for checking the segment limit in places say like, executing code in the code segment.  In other places such as what happens during a movs instruction, it calculates a linear address by adding the segment base to the address/offset held in esi or edi.  It doesn’t check the segment limit however (or even check for an integer overflow in adding the segment base for that matter).  This to me seems like a bug.  If you set up 2 segments, it could be possible to reference the 2nd segment from the 1st by overflowing the 1st segments limit.

Fixing this bug seems to be quite tricky.  The correct behaviour I think is to generate an access violation/exception when trying to dereference a segment:offset.  But QEMU generates translated code that calculates the linear address and holds it in its cache.  After calculating the address, at some point the software MMU is called using the linear address, and this is when the exception should be raised.

I had this same bug in my interpreted emulator although I knew of the problem when I first wrote the code, but then forgot about it.  The fix was pretty straight forward.  To fix the problem in the dynamic binary translation code I’m working on seems a little harder.  The ReadMemory function I use in the emulator actually takes a segment selector as a parameter, so I could simply pass the segment selector instead of trying to calculate a linear address from the segment base.  This is probably the easiest fix, but it means that the IR i’m working on becomes really ugly.  I don’t think its wise to use an IR that has segmentation, and it would be annoying to use such an IR in other projects..  I’m undecided how I’ll try to fix the segment limit problem for now..

I also spent a fair amount of time deciding about the problem of register aliasing.  In x86, al, ax, and eax all reference the same register.  They are register aliases.  I wasn’t sure if my IR should handle everything as 32bit, and not use different size operands.  I decided against this because I would lose some kinds of information.  The type reconstruction in a decompiler would seem to need the original operand sizes, and I wanted my IR to be potentially useful if I tried implementing this in the future.  I have also decided on using IR instructions that extract the lo/hi bytes akin to al/ah etc from a 32bit word, and also extracting the lower 16bits of a 32bit word.  This will make for better code generation for the DBT to have such instructions.  Valgrind actually has similar functions, except to get say ah from eax, it extracts the lower 16 bits first in 1 instruction, then the high byte from that 16bits in another instruction.

I have also currently decided to use registers in my IR instead of a block of memory representing the guest state like Valgrind uses.  Maybe the Valgrind approach is better, but for now I’m using registers.

REIL (the IR used in binnavi/zynamics etc) doesnt seem to have instructions that do the type of casts that I’m talking about above.  I only have the public online REIL documentation, so clearly they must have equivalent but for now I can’t see it documented.

Trying to figure out register allocation in dynamic binary translation

I’ve been trying to figure out what I need to do to implement dynamic binary translation in my emulator.

The most common theme when translating a basic block, is to disassemble each instruction.  From this generate an intermediate representation.  After the basic block is converted to this IR, perform any reasonable optimisations on the new code.  Then when done, generate native code from the IR.

The IR might use a small number of registers, and these registers can be statically mapped to a host.  QEMU follows this approach.  Guest registers are kept in memory.  This is probably the simplest approach to register allocation, but I suspect the native code QEMU generates can be greatly improved by keeping guest registers in real registers on the host and spilling when necessary.

So QEMU when faced with the x86 instruction, inc %ebx; inc %ebx, it generates IR similar to the following.

mov &context->ebx,%TEMP1
inc %TEMP1
mov %TEMP1, &context->ebx
mov &context->ebx,%TEMP1
inc %TEMP1
mov %TEMP1, &context->ebx

That’s not so bad but when you consider that most instructions in the basic block access the same register, it would be really nice if the IR generated code like this –>

mov &context->ebx,%ebx;
inc %ebx
inc %ebx
mov %ebx, &context->ebx

Clearly, there are advantages to using the QEMU approach, as register allocation problems are practically eliminated and at most only those temporaries defined by the IR are used in each host.

I haven’t looked in great depth at Valgrind, but it seems to use the more optimal approach to register usage and tries to map guest and host registers using a register allocation scheme. This is something I have to look into further.

But back to _my_ problem in designing an implementation. Let’s look at the x86 instruction    add %eax,10(%ebx,%ecx,4)

If our IR consists more of micro operations that reasonably faithfully translate the original instruction and not using the temporaries tricks that QEMU uses, a translation could result in something like this

mov %ecx,%temp1
mul 4, %temp1
add %ebx,%temp1
add 10,%temp1
LoadMem %temp1,%temp2
add %eax,%temp2
StoreMem %temp2,%temp1

All that work just to simplify an address mode!

As you can see, a problem is that we now require an extra 2 registers!  If we went about translating that IR back to x86, we would not be able to statically map all the registers without spillage.  I suppose we could always reserve 2 or so registers in the host, which would be alot easier than a register allocation algorithm..

The above is still a simplification.. ReadMem and StoreMem are most likely going to be functions in our software MMU implementation.. so this could result in the following –>

mov %ecx,%temp1
mul 4, %temp1
add %ebx,%temp1
add 10,%temp1
push %eax; // ?? eax is the return value of LoadMem, AND a guest/host register
push %temp1
call LoadMem
mov %eax,%temp2
pop %eax
add %eax,%temp2
push %temp2
push %temp1
call StoreMem

What about the return value of LoadMem? This uses up _another_ register! That’s 3 registers. Maybe another solution is to have LoadMem always store the memory contents not as a register, but in memory. This saves us a register, but may or may not be slower or faster.

I also had a look at the IR which they call REIL (reverse engineering intermediate language). An interesting aspect is they have only one conditional instruction, set if boolean zero. I guess they went to some work to change the eflags (or other status flags/condition codes for non x86) evaluation to be compatible with that. They clearly went for the simpler IR is the best approach.

OK. I’ve been rambling for long enough.. I highly doubt anyone would find this interesting besides me.

Finished an implementation of FProt hybrid unpacker paper

I’ve been working on implementing that FProt paper I talked about in my last post.  If you missed it –  The basic idea is to scan the binary for known code chunks, and instead of generic emulation,  implementing special handlers (or mini-emulators as the FProt paper describes them).

I implemented the Multi-Pattern Matching algorithm according to a paper by Wu and Manber  Snort once used a variation of this algorithm known as “mwm” – modified Wu Manber.  My implementation of the paper differs slightly from the original paper, and although I won’t outright say some details in the paper are wrong, I had to use my own interpretation of the algorithm for it all to work.

The algorithm is similar to the Boyer-Moore string matching algorithm  The Wu-Manber algorithm uses a hash of the last characters of the buffer to match to give a list of possible matches.  The bad character shift will be a minimum of shifts for each possible match given by the hash.  If that’s confusing, then look at how the Boyer Moore algorithm works and it will make more sense.

The results of the pattern matching were quite good, with a very minor performance impact, which allowed me to continue implementing the rest of the FProt paper without too many worries about worst case performance if no known code chunks are found.

From an instruction trace of my sample packed binaries, I could fairly easily see where most of the time in execution was being spent.  Basically I looked for code that was writing to memory where the decompressed binary would end up in.  These would loop and be part of the decompression functions.  I then looked for entry points (the code leading up to entering the loop) into that code and copied the complete decompression code into a flat text file.

I think the process of automatic extraction of decompression loops and checksum functions can potentially be automated.  Certainly its fairly easy to see the memory access of a decompression loop by looking at the range of memory writes that occur from that location.  Complete automation is something I will look into the future.  It would be really nice to see automatically generated static unpackers, and I really think such a thing is possible.

I then decompiled the asm decompression code, and ported it to my emulator.  This results in something resembling a static unpacker or mini-emulator.  An interesting thing to note, is that many packers share similar code to handle decompression.  aplib and minor variations of it are used in a number of packers I have.  Infact, aplib is used in mew/fsg/rlpack/packman and probably others.  It would have been nice to rip the code from the aplib sources, however, it’s not public domain so I decided to reverse it from the assembly in the packed binary.  This is also advantageous because I couldn’t get perfect matches for all the packed samples I had against the aplib versions I had.

Alot of debugging followed, and for the one binary I was targeting (rlpack), I had to include another chunk of code for calculating a checksum of the libraries to get a good performance increase.  But in the end, it all worked out quite well.

The result for rlpack was a performance increase of over 50% over complete generic emulation.  Unpacking rlpack-calc.exe took over .52 seconds using generic emulation, and with the mini-emulator code, it took about .25 seconds.  And remember that in the worst case and no code is identified, generic emulation occurs as usual.

There is still lots to do and many mini-emulators to write.  The question if dynamic binary translation is a completely superior approach is quite open to me.  I think using known code chunks and mini-emulators is faster, but at the price of invested development time.

Optimising away single stepping, and an interesting FProt paper

I had a look at how Bochs handles detection of the trap flag to indicate single stepping, and in general what happens is that before each instruction is executed, any exceptions held in a queue are processed.

Originally in my emulator’s main loop, I was checking for conditions such as hardware breakpoints and also if single stepping was enabled.  If it was, I’d throw an exception once the instruction was complete.

I was able to remove the explicit check for the hardware breakpoints the other day in the main loop and I mention that in another post, but today I was also able to eliminate the single step check.

To enable a single step the eflags register needs to be changed to set the trap flag.  This is done primarily by using the popf instruction which can set the trap flag.  pushf is used to obtain the current eflags contents, and that is or’d with 0x100 (from memory) to set the specific flag in question.   In theory, returning from a win32 exception handler could also set this flag (though I haven’t checked), and also SetThreadContext could set it.

Once the trap flag is set, at the end of the next instruction, an exception is raised, and the trap flag is cleared.  Some malware and packers (like telock) uses these types of exceptions to make reversing more difficult.

What I do now to handle this situation, is to instrument popf so that if the trap flag is set, the following instruction is modified in the instruction cache so that the function that handles its implementation points to a special function that in addition to performing the original function implementing that instruction, also raises an exception once complete.  It also restores the function pointer so as not to call the single step function again.

Its clear the instruction cache (see the threaded interpretation blog post), has become a really useful tool that enables such optimisations to occur.

And that kind of leaves me with not so many places I can further optimise, which brings me to an interesting paper by the FProt anti-virus team and how they implement unpacking in their engine

In essence, they have an emulator which can detect known code sequences in the process image, and when the process tries to execute such a sequence (fprot calls them known code chunks), it can emulate the sequence in one go with its own specific code without generically interpreting instruction by instruction. 

This appears to be a great way to significantly improve the performance of such an emulator, and I’m really interested to try to implement such a thing.  But having already implemented code to find crypto constants in the process image, its clear that a fast string search is necessary.  My current naive algorithm for finding crypto constants is noticeably slow.  So I will have to look at the standard algorithms used by anti-virus and nids.

The FProt paper doesn’t explicitly describe the algorithm it uses, but clearly after program loading, and whenever code is executed that was previously written to (indicating self modifying code), these are the times when searching for those known code chunks should occur.  And to avoid the cost of a map lookup at each EIP to match the known code chunks we found, the instruction cache entry for the beginning of each known code chunk should point to a special function that implements the ‘mini-emulator’ (as FProt calls it) which implements the emulation for that code chunk.  The has no cost associated with it, except for searching for those known code chunks.  It would probably make sense to search for known malware signatures at the same time.. but I’m not sure that I want to write a complete AV solution yet..

This post was probably not that interesting for most people, but that FProt paper is an interesting read none the less.

Lazy EFlags evaluation and other emulator optimisations

I haven’t been writing much code recently, so I don’t have too much to comment on, but in the past couple days I went about optimising the emulator again.  I probably ended up gaining a 10% performance improvement.

The arithmetic instructions in x86 modify the eflags register which contains in it 6 status bits: carry, auxilliary, overflow, sign, parity and zero flags.  An optimisation employed by both QEMU and Bochs is to defer most of the evaluation of the flags to when they are actually used.  This is fairly common as you can imagine a series of arithmetic instructions, and only the last one is useful in terms of the status flags, before they are used in say for example a conditional branch. is an interesting article on implementing eflags evaluation in Bochs.  In my implementation of lazy eflags evaluation, I didn’t employ many cool tricks and optimisations and have only seen a miniscule improvement in performance.  In most arithmetic functions now I make a copy of the source and destination operands, plus the result of the operation, and also the type of operation that occured (eg, ADDByte, ADDWord etc).  I use this operation type to caclulate the flags specific to the last operation that occurred.  Bochs and QEMU do a host of other funky optimisations and they have I believe much higher performance.  I will have to look into this further, as nearly all developers of x86 emulators say that eflags evaluation is a fairly good place for optimisation in an x86 emulator.

I also made some other improvements such as eliminating the prefix specific handling of REP/REPNE/REPE/LOCK in my emulator main loop.  Now they are treated more as instructions in their own right in terms of the dispatcher code.

I also eliminated the check for hardware execution breakpoints at every change of EIP in the main loop.  I now place a pseudo instruction at the location of a hardware breakpoint in my instruction cache which processes the breakpoint only when necessary.

Even with these simplifications of the emulator main loop, it still uses up over 30% of the CPU time.  I am now unpacking calc.exe packed with rlpack on average in under .6 seconds.  About 7.7 million instructions are processed, which means I’m executing more than 10 million instructions per second on my 2.5ghz quad core.

Using Dynamic Binary Translation is far superior to my implementation of interpretation however.  I’ve been thinking about starting work on this for a while.

Oh.. I spoke to my local University about a research project for a Master’s degree on on-demand malware unpacking and structural classification.  They were quite excited with the topic, but whether I start this project is really dependant on me starting (or not starting) work.  I still will try to go to Uni if I work, but I might have to shop around at different Universities.