I thought I would write a small summary on an interesting paper that appeared at ACM CCS. http://www.ecsl.cs.sunysb.edu/tr/TR246.pdf is a link to the paper ‘Large-Scale Malware Indexing Using Function-Call Graphs’ which originates from Symantec.

Essentially the system identifies variants of malware similar to http://www.vxclass.com. It looks at the call graphs of particular programs and identifies the similarity to known call graphs of malware. If the similarity is high, then a malware variant has been identified. Identifying call graph similarity was introduced in http://www.f-secure.com/weblog/archives/carrera_erdelyi_VB2004.pdf. Call graph similarity is found by identifying common nodes in the graph. These nodes can be identified as having identical control flow graphs, or having the same crc32 over the function represented by the node, and any other ways you can think of. The Symantec paper identifies that similarity is a function of the graph edit distance. The edit distance is the number of operations that must be performed to convert one thing to another. Incidentally, there are related edit distances such as tree edit distances, string edit distances and lots of variations depending on what type of operations are allowed. In fact, my paper to be published in January uses one of the edit distances I just mentioned. It’s a useful concept.

Vxclass builds graph similarity by the use of a greedy node matching algorithm. It actually tries to find unique nodes first so the greedy heuristic isn’t a problem. But presumably it falls back to the greedy solution once unique solutions are exhausted. The Symantec paper tries to find a minimum cost matching using a bipartite graph matching algorithm based on the Hungarian algorithm. This is novel.

The other novel feature with the Symantec paper is the use of metric trees to speed up searching the call graph database. A metric space (http://en.wikipedia.org/wiki/Metric_space) is defined by a distance function between two particular points or objects . The distance function must have certain conditions that are true, such as d(x,y) >= 0, d(x,y) == d(y,x) and d(x,x) == 0. Also the triangle inequality ( http://en.wikipedia.org/wiki/Triangle_inequality) must hold which is *d*(*x*, *z*) ≤ *d*(*x*, *y*) + *d*(*y*, *z*). If these conditions are true for all objects or points, then that’s great, because performing spatial searches on the data can take advantage of these properties and perform much faster. A metric tree takes advantage of the properties of a metric space and can perform spatial queries identifying similar or nearest neighbours of a point in that space faster than comparing each point or element in the space.

Using the earlier described notion of graph matching to show distance is approximately the same as the graph edit distance. The graph edit distance forms a metric space. The Symantec paper uses vantage point trees (http://en.wikipedia.org/wiki/VP-tree) but there are other trees such as BK Trees which perform well also. I seem to recall reading that vp trees are most suited for when the distance function is a real number. BK Trees only work on integer distances. I don’t know why Symantec chose vp trees over bk trees – maybe someone else can answer? Perhaps there is no significant difference.

The novelty of the Symantec paper is using metric trees to speed up the similarity search.

The final novel contribution in the Symantec paper is an ordering each malware by specific features such as the number of instructions the malware has. The database is arranged in a b+tree and vp tree structures are kept at the nodes. Then when indexing a malware, they can cull out all malware that is not reasonably close to the features of the input binary, before searching each vp tree in the b+tree buckets. This is a pretty simple optimisation which I think can be improved upon than what was demonstrated in the paper. But Symantec still did a good job in what they did.

This is a nice paper overall that improves malware indexing state of the art. It’s not a revolutionary paper really, but each area has contributions that improve what has previously been investigated. The time it takes to classify a sample is interesting – about 100 seconds for a database size of 100k malware. They want to have it scale up to a million malware.