Monthly Archives: February 2009

Emulator performace improvements using ‘threaded interpretation’

The entropy analysis code added a moderate performance penalty in general, though some in same cases the performance decreased significantly.  If you look at acprotect, there are about  90 places where written memory gets executed.  Entropy analysis gets performed at each one of those places.  I made some improvements to the entropy analysis code by using lookup tables, and its clear how to turn it into using a sliding window now, but I don’t think this is necessary in my use of it.  The lookup tables improved performance by 50% in worst case scenarios.

A good review of emulation techniques can be found here  It talks in one part about ‘threaded interpretation’.  The examples they use assume a very simplified instruction format.  The instruction can be completely decoded by looking at 1 byte which translates directly to an instruction, but the theory is still usable with more complex instructions encodings.

In RISC architectures, where a word tends to contain the opcode in a well defined region of bits, and the rest are operands, decoding is fairly easy.  In x86 and other CISC architectures, disassembling and decoding is much more complicated.  Infact, in my emulator, disassembly took up almost 40% of the CPU time.  Admitidly, the disassembler I use is not very fast, but reducing the amount of decoding that gets done is certainly a good thing.

I now maintain a cache of decoded/disassembled instructions.  Instead of going to the disassembler at every change of the program counter, I now perform a cache lookup.  All is good if the decoded instruction is in the cache, but if its then it needs to be inserted after performing the instruction decoding.  Much of the time, code gets called multiple times in loops and functions etc, so generally you expect a lot of cache hits.  To cope with self modifying code, writing to memory invalidates that part of the cache.  This all falls under the umbrella of threaded interpretation (if I’m getting my terminology write).

I use c+ STL sets to manage the cache.  Using arrays could be faster, but at the expense of more memory.  I will have to consider whether to try an implementation like this in the future.

Another performance improvement was made by speculating the result of a cache lookup.  For every cache hit, I speculate the next cache lookup will be the next ordered entry in the cache.  This makes sense when you consider that most execution occurs one instruction after the next.  At branch instructions etc, this speculation will fail.

The performance improvements were quite marked, probably towards a 40% speed improvement.  A month ago, it would have taken 20 seconds to unpack calc.exe packed with rlpack.   Now it takes a touch over 1 second, with additional processing to check for the OEP using entropy analysis.  I still don’t think its fast enough for on-access use in AV software, but its certainly getting faster.

PS.  I had this idea yesterday, that on multi-core machines, it would make sense to have a spare core disassembling the code ahead of time (and invalidating it if it turns out to be wrong due to self modifying code or disassembling the middle of instructions etc).  There may be some lock contention in managing the cache of decoded instructions, but if that turns out ok, it would mean alot of work could be done for free.  If I am up too it, I might try an implementation at some point.

[UPDATE] I tried implementing the cache using arrays instead of c++ set’s, and noticed another fairly significant speed improvement.  The array version was about 20-25% faster than the STL version, at the expense of using more memory.  That means that the total improvement ,with a cache compared to without a cache, is over 50%.

Entropy Analysis results take 2) Finding the OEP in Multi-Stage Packers

I deleted this post as I am using a similar mechanism in a research paper which will be published at a later date.

Entropy Analysis – Identifying packed binaries, and finding the Original Entry Point

I deleted this post as I am using a similar mechanism in a research paper which will be published at a later date.

Anti-debugging prefetch tricks and single stepping through a rep stos/movs

I was running my emulator against some malware, and came across one that was packed with an early version of telock (0.7x iirc).  My emulator didn’t handle it nicely, and neither did my tracer.  The problem was a rep stosb instruction which was overwriting itself (the target of the stosb was the address of the instruction).   Remember that it takes many steps in a rep, depending on ecx and weather fast string operations are being used on your cpu.  So single stepping through the rep, my emulator was decoding the instruction at that address after each step.  In the real world, the instruction is cached in a Prefetch Input Queue  Even though that wiki is a bit old and most of the prefetch tricks dont work anymore, I’ve read that rep stos and rep movs still use the prefetch.  What this means, is that the rep stos instruction in that malware I was looking at is cached, and even if the instruction is overwritten in memory, the prefetch’d instruction is still used for execution.  Flushing the PIQ occurs in a number of places, including during single stepping.  So when single stepping that malware with the prefetch trick, it behaves differently than if it were running normally.  In my case, the malware was crashing.

It was a fairly straight forward task to modify my emulator to handle this correctly, but a different story in making my tracer work around it.  I really needed to single step for the most part, but could cope with losing stepping through a rep stos/movs.  I thought of a few ideas, including setting a breakpoint after the instruction to simulate a single step.  In the end I came up with a different approach.

I single step as per usual, but when I come across a rep movs/stos, I handle it specifically.  I copy the memory at eip for the instruction length – this is what is in the prefetch.

I still step through the rep movs/stos, but before executing it, I take note of %edi.  If it points to the address of the instruction, I copy the contents of that memory location after the step, as this will be the real contents of memory after the prefetch.

I then restore the memory that was modified to its original data (what is in the prefetch).   This allows the instruction to continue as if the prefetch and the the target memory were exactly the same.

At the end of the rep, I replace the memory with the modified version I’ve been building.

Dunno how good of a description that is.. My English skills were never the best.  In any case, this works fairly well.  I have some concerns if a rep movs occurs with a source and destination both pointing to parts of the current instruction.. But I don’t think any malware is using this form of rep movs, so I’ll not spend any more time on it. 

And yes, I did manage to unpack telock with my emulator once this was working – and also verify it with my tracer 🙂