I had a look at how Bochs handles detection of the trap flag to indicate single stepping, and in general what happens is that before each instruction is executed, any exceptions held in a queue are processed.
Originally in my emulator’s main loop, I was checking for conditions such as hardware breakpoints and also if single stepping was enabled. If it was, I’d throw an exception once the instruction was complete.
I was able to remove the explicit check for the hardware breakpoints the other day in the main loop and I mention that in another post, but today I was also able to eliminate the single step check.
To enable a single step the eflags register needs to be changed to set the trap flag. This is done primarily by using the popf instruction which can set the trap flag. pushf is used to obtain the current eflags contents, and that is or’d with 0x100 (from memory) to set the specific flag in question. In theory, returning from a win32 exception handler could also set this flag (though I haven’t checked), and also SetThreadContext could set it.
Once the trap flag is set, at the end of the next instruction, an exception is raised, and the trap flag is cleared. Some malware and packers (like telock) uses these types of exceptions to make reversing more difficult.
What I do now to handle this situation, is to instrument popf so that if the trap flag is set, the following instruction is modified in the instruction cache so that the function that handles its implementation points to a special function that in addition to performing the original function implementing that instruction, also raises an exception once complete. It also restores the function pointer so as not to call the single step function again.
Its clear the instruction cache (see the threaded interpretation blog post), has become a really useful tool that enables such optimisations to occur.
And that kind of leaves me with not so many places I can further optimise, which brings me to an interesting paper by the FProt anti-virus team and how they implement unpacking in their engine http://www.datasecurity-event.com/uploads/hybrid_approach.pdf.
In essence, they have an emulator which can detect known code sequences in the process image, and when the process tries to execute such a sequence (fprot calls them known code chunks), it can emulate the sequence in one go with its own specific code without generically interpreting instruction by instruction.
This appears to be a great way to significantly improve the performance of such an emulator, and I’m really interested to try to implement such a thing. But having already implemented code to find crypto constants in the process image, its clear that a fast string search is necessary. My current naive algorithm for finding crypto constants is noticeably slow. So I will have to look at the standard algorithms used by anti-virus and nids.
The FProt paper doesn’t explicitly describe the algorithm it uses, but clearly after program loading, and whenever code is executed that was previously written to (indicating self modifying code), these are the times when searching for those known code chunks should occur. And to avoid the cost of a map lookup at each EIP to match the known code chunks we found, the instruction cache entry for the beginning of each known code chunk should point to a special function that implements the ‘mini-emulator’ (as FProt calls it) which implements the emulation for that code chunk. The has no cost associated with it, except for searching for those known code chunks. It would probably make sense to search for known malware signatures at the same time.. but I’m not sure that I want to write a complete AV solution yet..
This post was probably not that interesting for most people, but that FProt paper is an interesting read none the less.