The emulator is responsible for loading the target binary, and also loading the DLL’s the binary requires. In my initial attempts at implementing this, I didn’t actually load the DLL’s specified in the import table, into memory. But a number of packers were accessing the DLL’s address space, if just only to check that there was no software breakpoints being set on the imported functions. Loading the DLL’s isnt really necesssary in this case, but potentially malware might try parsing the DLL’s directly to obtain function address, in their own implementation of say GetProcAddress. Loading the DLL into the address space solves those problems.
I also initially tried hooking functions based only on the address in the import table. Back when I wasn’t loading DLL’s, I would use known numbers acting as markers placed in the IAT, so that when one of those numbers appeared in the instruction pointer, I would know that one of those imports whose addresses are kept in the IAT had been called. I would also implement GetProcAddress to return these markers also.
By loading the DLL’s, I now hook based on the values in the export table on each DLL. To implement this, I created a global symbol table from when ever DLL that was loaded, would register its exports.
This worked adequately enough, except it was slow. Parsing the export table for every DLL and building a symbol table added about a second in unpacking a upx hello world. The total time for unpacking went from about .7 seconds to about 2 seconds. I also added a number of other things that slowed it down, but were necessary for correct emulation. In reality the actual time increase seems to be about 1 second.
That brings me to yesterday, when I went about reimplementing the export handling. I modified it so that only specific exports would have lookups performed. I do not parse the entire export table by default now. I do lookups of functions in the targets import table, and win32 functions that I emulate.
Has the performance increased with the the changes I’ve done? Yes. I can now unpack UPX in about .7 seconds (or .4/.5 seconds when the file is cached). I can unpack rlpack against calc.exe in about 6 to 7 seconds. I added a number of other optimisations since last week when it was taking 20 seconds. Before the export handling patches it was taking about 8 to 9 seconds to unpack.
I still have a couple more tweaks to do.. I don’t think I am currently using the import table hints (a hint is a suggested index into the export table), even though I’ve written the code for it. Also, adding hints to the win32 functions I emulate would be another good optimisation.
I tried unpacking pelock and noticed an interesting ‘bug’ that appeared in the emulator, or rather when the emulator is being debugged with the tracer running in parallel. pelock checks for the existance of software breakpoints on the original imports of the unpacked binary. In my case, it happened upon __initenv which is not a function, but data. This variable gets assigned presumably during DLL loading, but ofcourse, I don’t implement an analogue. So the variable in my emulator was the value according to the file contents, and not the startup code. So tracing this resulted in a cmp of $0xcc against that value in memory, which is different than what was being emulated, and resulted in the status flags differeing and my tracer/emulator aborting.
Not sure how to fix this.. I might go the whole hog and implement the ability to run specific code for each DLL (which when being debugged, can copy the real data from the traced program).