Automatic malware classification question.. and possible answer.

I’ve been looking at Zynamics vxclass (http://www.vxclass.com) and had a look at one of their presentations http://www.sourceconference.com/2008/sessions/pdf/carrera-AutomatedStructuralMalwareClassification.pdf.  Essentially its about identifying malware based on its structural (control flow) properties.  It can thus identify malware with byte level changes, if their structural properties remain (as in the case of recompiling, or adding minor functionality etc).

Essentially they take the callgraph and the control flow graph of each function.  For each functions flowgraph, they develop a signature, and use this to show a mapping between that callgraph and another thats in a stored database.  The also show mappings between functions with the same signatures.

They talk about one possible attack about using a dispatcher that hides the callgraph structure.  Essentially all function calls go through the dispatcher which obfuscates the call.  The authors of vxclass argue that information still remains in the flowgraphs of each function and matches to these can identify the malware.  The dispatcher trick sounds a bit like “nanomites” which replaces branch instructions with an int3 and encoded pointers in the Armidillo packer (or so I’m told).  I’ve also read about thess types of tricks in obfuscating disassembly papers.  I think structural analaysis will fail when both the dispatcher and nanomites tricks are combined to obfuscate both the call graph and the control flow graphs of each function.  But this isn’t what I’m really posting about..

What I’m curious about, is that it seems clear that vxclass disassembles and builds control flow graphs of functions even if they are not determined as being directly reachable from the entry point (ignoring indirect calls with indeterminable targets).  IDA does this, but IIRC it also wants there to be at least one call instruction to each unreachable function.  Even if that call is not determined as reachable itself.

The problem with including such functions, is that after unpacking, there is still likely alot of functions left around from the packer.  These functions are not wanted in the structural analysis phase, since we are only interested in whats been packed, not the packer.

OK.  I thought of a possible solution (I somtimes think i tend to use my blog more like a workpad where I can think problems though).  If we keep track of functions that have been executed, and exclude them from the structural analysis – and also do the non directly reachable functions.   There is some slight discrepencies if the unpacking stage doesn’t determine perfectly the original entry point, but for a largish program, this can probably be ignored.   

The reason why I’m looking at vxclass and thinking of these problems, is that I’m considering how difficult it is to incorporate malware classification into my unpacker.  I wrote some code last year to generate control flow graphs, callgraphs, and so some simplifications of the graphs to eliminate unconditional jumps.. This could be re-used in to do the structural classification.  Anyway, I’m not certain I’m going to try to implement it yet..  But it would perhaps result in a usable semi-realtime (not on-access) AV emulator/classification engine..

2 responses to “Automatic malware classification question.. and possible answer.

  1. classification is about features extraction imo, once you got that there are many well documented classification techniques. In the paper they would let you think their product is robust because flowchart or callchart is not that easy to mess up.. i’m not sure about that, messing up is always simple. I’ve not a copy of vxclass to play with but i’m wondering if their method is susceptible to spoofing: if they rely on the structure of the graph it’s not too hard to forge one ad-hoc to spoof the signature of some other malware. hope to read more about the classification engine🙂

  2. Regarding classifying functions from within the packer, you’re thinking too hard: simply process functions from the section in which OEP lies.

    Regarding breaking VxClass, it’s trivial: compile-time obfuscation that destroys the flowgraphs. gcc -S and feed it to a perl script.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s