Emulator performace improvements using ‘threaded interpretation’

The entropy analysis code added a moderate performance penalty in general, though some in same cases the performance decreased significantly.  If you look at acprotect, there are about  90 places where written memory gets executed.  Entropy analysis gets performed at each one of those places.  I made some improvements to the entropy analysis code by using lookup tables, and its clear how to turn it into using a sliding window now, but I don’t think this is necessary in my use of it.  The lookup tables improved performance by 50% in worst case scenarios.

A good review of emulation techniques can be found here http://apple1.chez.com/Apple1project/Docs/pdf/Study_techniques_for_emulation_programming.pdf.  It talks in one part about ‘threaded interpretation’.  The examples they use assume a very simplified instruction format.  The instruction can be completely decoded by looking at 1 byte which translates directly to an instruction, but the theory is still usable with more complex instructions encodings.

In RISC architectures, where a word tends to contain the opcode in a well defined region of bits, and the rest are operands, decoding is fairly easy.  In x86 and other CISC architectures, disassembling and decoding is much more complicated.  Infact, in my emulator, disassembly took up almost 40% of the CPU time.  Admitidly, the disassembler I use is not very fast, but reducing the amount of decoding that gets done is certainly a good thing.

I now maintain a cache of decoded/disassembled instructions.  Instead of going to the disassembler at every change of the program counter, I now perform a cache lookup.  All is good if the decoded instruction is in the cache, but if its then it needs to be inserted after performing the instruction decoding.  Much of the time, code gets called multiple times in loops and functions etc, so generally you expect a lot of cache hits.  To cope with self modifying code, writing to memory invalidates that part of the cache.  This all falls under the umbrella of threaded interpretation (if I’m getting my terminology write).

I use c+ STL sets to manage the cache.  Using arrays could be faster, but at the expense of more memory.  I will have to consider whether to try an implementation like this in the future.

Another performance improvement was made by speculating the result of a cache lookup.  For every cache hit, I speculate the next cache lookup will be the next ordered entry in the cache.  This makes sense when you consider that most execution occurs one instruction after the next.  At branch instructions etc, this speculation will fail.

The performance improvements were quite marked, probably towards a 40% speed improvement.  A month ago, it would have taken 20 seconds to unpack calc.exe packed with rlpack.   Now it takes a touch over 1 second, with additional processing to check for the OEP using entropy analysis.  I still don’t think its fast enough for on-access use in AV software, but its certainly getting faster.

PS.  I had this idea yesterday, that on multi-core machines, it would make sense to have a spare core disassembling the code ahead of time (and invalidating it if it turns out to be wrong due to self modifying code or disassembling the middle of instructions etc).  There may be some lock contention in managing the cache of decoded instructions, but if that turns out ok, it would mean alot of work could be done for free.  If I am up too it, I might try an implementation at some point.

[UPDATE] I tried implementing the cache using arrays instead of c++ set’s, and noticed another fairly significant speed improvement.  The array version was about 20-25% faster than the STL version, at the expense of using more memory.  That means that the total improvement ,with a cache compared to without a cache, is over 50%.

One response to “Emulator performace improvements using ‘threaded interpretation’

  1. you might want to have a look at Google Native Client’s built-in disassembler. They use “table-driven” disassembly and it seems pretty efficient.

Leave a reply to dan Cancel reply