Lazy EFlags evaluation and other emulator optimisations

I haven’t been writing much code recently, so I don’t have too much to comment on, but in the past couple days I went about optimising the emulator again.  I probably ended up gaining a 10% performance improvement.

The arithmetic instructions in x86 modify the eflags register which contains in it 6 status bits: carry, auxilliary, overflow, sign, parity and zero flags.  An optimisation employed by both QEMU and Bochs is to defer most of the evaluation of the flags to when they are actually used.  This is fairly common as you can imagine a series of arithmetic instructions, and only the last one is useful in terms of the status flags, before they are used in say for example a conditional branch. is an interesting article on implementing eflags evaluation in Bochs.  In my implementation of lazy eflags evaluation, I didn’t employ many cool tricks and optimisations and have only seen a miniscule improvement in performance.  In most arithmetic functions now I make a copy of the source and destination operands, plus the result of the operation, and also the type of operation that occured (eg, ADDByte, ADDWord etc).  I use this operation type to caclulate the flags specific to the last operation that occurred.  Bochs and QEMU do a host of other funky optimisations and they have I believe much higher performance.  I will have to look into this further, as nearly all developers of x86 emulators say that eflags evaluation is a fairly good place for optimisation in an x86 emulator.

I also made some other improvements such as eliminating the prefix specific handling of REP/REPNE/REPE/LOCK in my emulator main loop.  Now they are treated more as instructions in their own right in terms of the dispatcher code.

I also eliminated the check for hardware execution breakpoints at every change of EIP in the main loop.  I now place a pseudo instruction at the location of a hardware breakpoint in my instruction cache which processes the breakpoint only when necessary.

Even with these simplifications of the emulator main loop, it still uses up over 30% of the CPU time.  I am now unpacking calc.exe packed with rlpack on average in under .6 seconds.  About 7.7 million instructions are processed, which means I’m executing more than 10 million instructions per second on my 2.5ghz quad core.

Using Dynamic Binary Translation is far superior to my implementation of interpretation however.  I’ve been thinking about starting work on this for a while.

Oh.. I spoke to my local University about a research project for a Master’s degree on on-demand malware unpacking and structural classification.  They were quite excited with the topic, but whether I start this project is really dependant on me starting (or not starting) work.  I still will try to go to Uni if I work, but I might have to shop around at different Universities.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s