Monthly Archives: January 2009

Automatic malware classification question.. and possible answer.

I’ve been looking at Zynamics vxclass ( and had a look at one of their presentations  Essentially its about identifying malware based on its structural (control flow) properties.  It can thus identify malware with byte level changes, if their structural properties remain (as in the case of recompiling, or adding minor functionality etc).

Essentially they take the callgraph and the control flow graph of each function.  For each functions flowgraph, they develop a signature, and use this to show a mapping between that callgraph and another thats in a stored database.  The also show mappings between functions with the same signatures.

They talk about one possible attack about using a dispatcher that hides the callgraph structure.  Essentially all function calls go through the dispatcher which obfuscates the call.  The authors of vxclass argue that information still remains in the flowgraphs of each function and matches to these can identify the malware.  The dispatcher trick sounds a bit like “nanomites” which replaces branch instructions with an int3 and encoded pointers in the Armidillo packer (or so I’m told).  I’ve also read about thess types of tricks in obfuscating disassembly papers.  I think structural analaysis will fail when both the dispatcher and nanomites tricks are combined to obfuscate both the call graph and the control flow graphs of each function.  But this isn’t what I’m really posting about..

What I’m curious about, is that it seems clear that vxclass disassembles and builds control flow graphs of functions even if they are not determined as being directly reachable from the entry point (ignoring indirect calls with indeterminable targets).  IDA does this, but IIRC it also wants there to be at least one call instruction to each unreachable function.  Even if that call is not determined as reachable itself.

The problem with including such functions, is that after unpacking, there is still likely alot of functions left around from the packer.  These functions are not wanted in the structural analysis phase, since we are only interested in whats been packed, not the packer.

OK.  I thought of a possible solution (I somtimes think i tend to use my blog more like a workpad where I can think problems though).  If we keep track of functions that have been executed, and exclude them from the structural analysis – and also do the non directly reachable functions.   There is some slight discrepencies if the unpacking stage doesn’t determine perfectly the original entry point, but for a largish program, this can probably be ignored.   

The reason why I’m looking at vxclass and thinking of these problems, is that I’m considering how difficult it is to incorporate malware classification into my unpacker.  I wrote some code last year to generate control flow graphs, callgraphs, and so some simplifications of the graphs to eliminate unconditional jumps.. This could be re-used in to do the structural classification.  Anyway, I’m not certain I’m going to try to implement it yet..  But it would perhaps result in a usable semi-realtime (not on-access) AV emulator/classification engine..


XP anti-debugging trick in Virtob.F crashes in Vista

Virtob.F isn’t too new, but it was recently the first time I looked at it.  I ran it in my emulator and it quickly crashed.

A little inspection and I saw it was using the return address on the stack from the EntryPoint code.  At least in Vista, the entry point is called like a regular function from kernel32.dll.  The PEB Is passed as an argument.  If the function returns, then one of the Rtl ExitProcess functions is called to terminate (using %eax as an argument).  This points clearly to the fact that somewhere in kernel32.dll takes control of the function first, probably with the entry point of the program and the PEB as an argument.  I haven’t investigate too much furthur beyond disassembling the caller of the entry point.

In any case, Virtob.F took the return address ([esp] at program start], subtracted 15 from it, and checked the contents of that memory address against the value 8.  If it didn’t match, it would branch to a non existant address and cause an exception/crash.

Turns out on XP and Vista, the calling code of the entry point changed.  On XP, the value described above is indeed 8.  In Vista, it’s something else.

I also came into some other interesting anti-debugging tricks with some other packers.  An early version of telock (as identified by peid) was overwriting its own instruction with a rep stosb; a prefetch trick.  My emulator and tracer don’t support that yet, but I might implement that today.. A few packers were using anti-emulation loops; one piece of malware wanted to iterate through a loop that modified memory 53 million times.

Maybe this post wasn’t blog worthy afterall.. hopefully someone will find it mildly interesting 😉

VMWare discrepency – another (small) means of VMWare detection.

[update: changed title to get better google seaches. OK. I suck ;-)]

I installed Vista in a VMWare image so I could test and debug my emulator against real malware from  I haven’t been overy successful, but while unpacking one of the Netsky variants packed with telock which uses hardware breakpoints, I came across a bug in VMWare.

I haven’t spent too much time analyzing the bug, but it seems to be the following is occuring..

If you set an execution hardware breakpoint on an address and single step onto that instruction, it immediately should raise an exception before completing the instruction (Intel specs say the exception occurs before execution of the instruction).  What happens in VMWare, is that the exception isn’t raised at this point.  It is only raised when a subsequent single step operation occurs.

Quick hack implements stolen bytes emulation

It turns out I had actually fully unpacked pespin (or at least emulated all that could be), and the stolen functions I was emulating were from the original binary ;-|

But I implemented a quick hack before then to emulate stolen bytes.  I simply record %esp and %ebp after an indirect call has been made.  I assume that all calls to stolen functions are indirect.  I emulate up until the point the stolen function tries to jump back to the original win32 function.  At this point, I assume the nearest symbol is the original function.  I restore esp and ebp to what they were at the last indirect call, and then call the emulation code for that particular function.  The reason why I restore esp/ebp is that my emulated win32 functions use them for arguments and also to obtain the return address.  Also, they are the only registers which arent rewritten by win32 functions.

There is a simple way to make the analysis fail, but I won’t say for now.

pespin also steals the code at the entry point.

I forgot to mention, that in tracer mode I was also able to unpack yoda’s protector 1.02.  I can’t do 1.03 as it has a number of win32 functions I don’t emulate; such as window handling.

[edit: added a couple hours later]

I added some extra code to check that an address belongs to a stolen function.  I disassemble the original function (from disk) up until the first branch instruction.  If the address being jumped too is between the start of the function and the first branch instruction, the address is part of the stolen function.  This method is alot safer than taking simply the nearest symbol without any further checks.  Of course, if alot of code is stolen it wont work.. I’ll wait until I see some malware do that before tackling the issue (I haven’t emulated enough samples obviously)

Some ideas on how handle stolen bytes in an AV style emulator.

I have been working on emulating pespin.  Pespin is a public packer for thos who don’t know.  I came across ‘stolen bytes’ protection in pespin which seems to make it very hard on my emulator to handle.

Remember that my emulator works by emulating userland x86 instructions and the win32 api.  That is, it never actually executes real win32 code contained in dlls like kernel32.dll.  It maintains a list of addresses representing exported functions in each dll, and if the program counter changes to point to one of those addresses, the emulator implements that function.  Loading the actual dll’s into the memory image is kind of optional.  I do it to maintain better emulation, but for the most part the memory is only for show.  It never gets executed.

Most AV emulators work in that fashion from what I can gather.  Another approach is to use a whole system emulator like bochs or qemu, and let  the emulated operating system execute win32 functions themselves.  In this case, the emulator doesn’t have seperate code for each api function.  It’s more reliable than the other approach, but not really used in commercial AV.

So back to pespin.. Instead of calling a win32 function directly, it copies the first 10 or so bytes from the function onto the heap.   At the end of those 10 bytes on the heap, a jump is made back to the original function skipping the bytes that were copied.  Calls to that win32 function  are made to point to the function on the heap.  In some versions of stolen bytes, the 10 or so bytes of code from the original function are destroyed.  Also, in some versions, the copied code is obfuscated with junk instructions or instruction reordering.

What this achieves in pespin, is that you its hard to trace calls to win32.  Intercepting execution of the original win32 function doesn’t work, because the function being called is the copied version on the heap.

For an AV style emulator, this spells disaster.   My win32 emulation is entirely dependant on intercepting calls to specific addresses for each exported function.  The short story, is that the win32 function being stolen never gets emulated, and the analysis aborts.  I can detect easily enough that something is wrong, because from the jump back to the original win32 function plus 10 or so bytes, that part of memory I know belongs to a loaded dll and I have marked it as a special case of being non executable.

Even whole system emulator based unpackers like Pandora’s Bochs have problems with pe reconstruction of the unpacked binary when stolen bytes are being used.  Though the actual emulation itself is handled correctly.

How to solve the problem of stolen bytes in an AV style emulator?  The short answer is that I don’t have a good solution to this yet.. Not sure that I will be able to find one either. (edit: while writing this post I think I have come up with some reasonable solutions)

One thing that could be done, is to replace the first few bytes of each win32 api function with a byte sequence that the emulator can detect.  This could be a  software breakpoint (not an int3 as this is normally checked for), or a transfer instruction to memory which is designated as representing that function.  If using a software breakpoint, we need to a special way of marking so that we can associate that breakpoint with the specific function.  Maybe just breakpoint followed by the address of the original function.

If we implemented this theme, what happens is stealing those first 10 or so bytes of that win32 function, copies our marker.  And when the copied function is called, we can easily detect which function it is.  The problem with this approach, and why I’m reluctant to implement this, is that it seems fairly easy to fingerprint the emulator.  If for example the set of win32 functions being emulated all have the same basic first few bytes.

Maybe we could make this marker polymorphic.  Maybe an approach is to use a technique used in viruses known as surface tracing.  Disassemble the function until the first control transfer instruction is made, and replace it with a transfer of our own that can be used to identify the function.  It should be possible to track function arguments automatically with some work, even when esp changes.

That might work ok.  The biggest problem with implementing it would be integrating it with my program tracer.

Hmm.. another idea which might be better is to catch the jump back into the original win32 function.  I can do this reasonably well already.  I can narrow it down to a particular function without real worries (the nearest symbol for instance).

Now I imagine what happens when stolen bytes are implemented, is that a disassembler disassembles the original win32 function up until the first control transfer instruction, or maybe just up until the first 10 instructions.  It might all be a disaster on my part if it doesnt disassemble, and uses the reloc information instead.

So I know the original function thats stolen, I just need to get the arguments in working order, and I can do a static analysis to adjust esp to get the original arguments.  This analysis would be a very simplified (not following branches) version of IDA’s stack tracing.

OK.. I think I will go ahead and implement that.  The other idea I had was using dynamic taint analysis to track the stealing of each function, but I think the idea I just talked about will work better, as I didn’t have a great plan for the taint analysis.

A few more words on say if it copies the entire function and uses reloc information to relocate it.  I am really stuck here (but maybe the dynamic taint analysis is the way to go).  The long approach is to implement the windows system calls, so that in theory the emulator could execute windows api functions directly.

Anyway.. this post turned more into a rambling more than anything else.  I guess I will implement a stack tracer now..

Vista “shim” engine messing with imports

An interesting “bug” came up in my emulator today.  I was resolving a GetProcAddress to the wrong import (as compared to a live program).  I hadn’t noticed this before, because I had accidentally set my emulator debug mode to do a one time correction of any discrepencies, by copying the relevant memory contents from the live program at load time.

What I discovered, was that GetProcAddress was actually being resolved to point somewhere in SHIMENG.DLL; but for all purposes was equivalent to the original GetProcAddress.   I couldn’t find a name to match the address using the Microsoft debug symbols, but it did show the prototype as being the same as GetProcAddress.

Infact, SHIMENG.DLL is part of a compatability layer implemented in Vista.  It allows for import hooking, and these hooks can replace the function of any specific API.  This allows for the implementation of legacy API functions that would since be considered obsolete.   It also allows for legacy beahviour to be implemented.

Apparently, using the microsoft application compatability toolkit, shims can be created for specific executables.  The idea is that you load into the shim database information for those legacy applications.

The PE loader if it is to implement a shim must also hook GetProcAddress since the address it returns must point to the correctly hooked function also.  This is what I think is happening with GetProcAddress by default in Vista.  It points to the shim version.

There is not a whole lot of documentation on the shim engine used in Windows, but its certainly interesting to note.

Profiling with cygwin/mingw (some bugs, hangs, fixes, new features)

In the past few days I’ve been trying to optimise my emulator.  It’s been going quite well, though I’m running out of good optimisations to do now.

It all started going downhill when I tried using gprof in cygwin.  Turns out there is a bug in the current version which results in profile times showing zeros instead of real figures. talks about this bug and provides some fixes.  Basically you need the latest gprof.exe or a really old one.  I checked out the source from CVS and rebuilt it.

The next problem I came across was that not everything was being profiled.

I wrongly thought I had to recompile libstdc++ with profiling support when I came across the next bug.  In all versions of mingw and cygwin, a program that is compiled with profiling support, and that exits very quickly, may hang.  I noticed this when configure scripts were hanging after testing the C compiler with int main() { return 0; }.

The bug turned out to be the use of TerminateThread in libgmon.

The GNU profiler works by injecting code into the target program to call the profiling code.  This profiling code is initiated by monstartup, and creates a new thread which every 100 milliseconds or so looks at the main thread (the program being profiled) and notes done  in a list where the program counter is.  At program exit, this list is dumped to gmon.out.  Then gprof parses gmon.out matching those addresses to symbols in the binary.

To turn off profiling in mingw, such as when the program is exiting. profile_off is called which uses TerminateThread to kill the profiling thread.  The problem with this is that TerminateThread can kill the thread when its just acquired a lock.  If the main thread uses this lock, it can hang since the terminated thread never released it.  The profiling thread in minimimalistic but it still uses some win32 functions which shouldn’t be terminated mid stream.

I rewrote the profiling exit code to exit gracefully, by examining a flag indicating termination is requested.  This fixed the hang.

I will hopefully submit a patch to mingw when I have some time.

In any case, I still wasn’t able to get full profiling information from my program.  Infact, libstdc++ is linked in statically to mingw.  Debug symbols are important to keep, as gprof works by using symbols available in the binary.  I overwrite libstdc++.a with the debug version since it wasn’t being used automatically.  Take that advice with a grain of salt.  Don’t do it unless you think you need to.  I could be wrong.

Another problem with profiling, is that it doesnt profile library or system calls.  I wrote a patch to libgmon so that if during a sample it found the program counter not to be contained by the program (eg, a library call), then it would do a stack traversal to find the first caller that can is inside the image; and thus can be profiled.

This works and I get generally close to 100% total times in a flat profile (I assume any discrepancy is from rounding errors).  The only problem with my patch to libgmon, is that you don’t get much extra information.  For example, I know that the new operator is using up 15% of the time in my program, but I don’t know who calls it.

In this particular case, I simply might hack at the code to extract the information on a case by case basis, but I don’t have a general solution which can be neatly packaged in a patch.

If your interested.. After a few rounds of optimisation, I can unpack a minimalist program of 5k by upx in about 0.15 seconds.  This is on my 2.5gz quadcore.  I can unpack calc.exe which is around 112k, packed by rlpack which is of medium complexity, in about 2.15 seconds.  There are other packers which are more complex than rlpack which incur more processing costs but I haven’t run too many tests yet.  I think my emulator is fast enough as-is for use in an email scanning AV, but not fast enough for on-access.