I’ve been looking at Zynamics vxclass (http://www.vxclass.com) and had a look at one of their presentations http://www.sourceconference.com/2008/sessions/pdf/carrera-AutomatedStructuralMalwareClassification.pdf. Essentially its about identifying malware based on its structural (control flow) properties. It can thus identify malware with byte level changes, if their structural properties remain (as in the case of recompiling, or adding minor functionality etc).
Essentially they take the callgraph and the control flow graph of each function. For each functions flowgraph, they develop a signature, and use this to show a mapping between that callgraph and another thats in a stored database. The also show mappings between functions with the same signatures.
They talk about one possible attack about using a dispatcher that hides the callgraph structure. Essentially all function calls go through the dispatcher which obfuscates the call. The authors of vxclass argue that information still remains in the flowgraphs of each function and matches to these can identify the malware. The dispatcher trick sounds a bit like “nanomites” which replaces branch instructions with an int3 and encoded pointers in the Armidillo packer (or so I’m told). I’ve also read about thess types of tricks in obfuscating disassembly papers. I think structural analaysis will fail when both the dispatcher and nanomites tricks are combined to obfuscate both the call graph and the control flow graphs of each function. But this isn’t what I’m really posting about..
What I’m curious about, is that it seems clear that vxclass disassembles and builds control flow graphs of functions even if they are not determined as being directly reachable from the entry point (ignoring indirect calls with indeterminable targets). IDA does this, but IIRC it also wants there to be at least one call instruction to each unreachable function. Even if that call is not determined as reachable itself.
The problem with including such functions, is that after unpacking, there is still likely alot of functions left around from the packer. These functions are not wanted in the structural analysis phase, since we are only interested in whats been packed, not the packer.
OK. I thought of a possible solution (I somtimes think i tend to use my blog more like a workpad where I can think problems though). If we keep track of functions that have been executed, and exclude them from the structural analysis – and also do the non directly reachable functions. There is some slight discrepencies if the unpacking stage doesn’t determine perfectly the original entry point, but for a largish program, this can probably be ignored.
The reason why I’m looking at vxclass and thinking of these problems, is that I’m considering how difficult it is to incorporate malware classification into my unpacker. I wrote some code last year to generate control flow graphs, callgraphs, and so some simplifications of the graphs to eliminate unconditional jumps.. This could be re-used in to do the structural classification. Anyway, I’m not certain I’m going to try to implement it yet.. But it would perhaps result in a usable semi-realtime (not on-access) AV emulator/classification engine..