Antivirus IPC and Middleware

In this post, I’ll discuss some of the design issues, approaches and solutions I’ve encountered and taken designing and implementing my prototype Antivirus scanner. Specifically, this post will look at some of the issues involved with inter process communication, client-server Antivirus and middleware such as web services.

In the beginning, my Antivirus scanner was one single standalone application. It took an argument in the command line to say which binary should be scanned. This is the simplest approach, and not the most efficient. I discovered during evaluation that when the signature database was of a moderate size, loading of the database into main memory was taking a significant amount of time. In addition, simply loading the application into memory was taking longer than expected due to the large size of the binary and the number of libraries it linked against. In some windows systems caching allowed program loading to perform faster, but the database issue was a showstopper if I was going to scan any more than a trivial number of binaries.

The solution I chose was to make the Antivirus scanner a client-server architecture. I would start the scanner as a server, and take as long as necessary to load the database. The client would be a very compact and simple application that submitted jobs to the server and received a response. At this point the issue of IPC and middleware raised its head. How do the client and server communicate?

Possibly communication could be done using local IPC. I chose sockets. This enabled me to create separate clients for Linux and Windows, and submit jobs across the network. This was useful at the time because I wanted my malware collection to be stored on a fairly isolated Linux box to reduce the chance of unwanted malware infections.

Sockets are great, but how to implement the client-server model? One option is to use a custom protocol. Another option is to use middleware solutions such as RPC or even things like SOAP or Corba. I chose to go with a custom protocol.

How does the server display the results of each scan you might ask. Originally, I used a very ad hoc text-based output. Very hacky, very simple. I had a set of shell scripts to generate statistics on scan results using this output. The problem with this naive approach is that every time I slightly changed the result data, the scripts would break. My solution to this was to modify the output to use XML. I also rewrote those particular scripts to use Python which has reasonable XML processing capabilities.

Now I’m getting to the point where I want to expose the Antivirus system to the internet. How do I go about this? Web access seems important. My first approach at exposing the system was by using a PHP written web interface that connected to the server using the custom protocol I talked about earlier. The PHP also managed some SQL databases. There is more to an Antivirus scanner than just the engine, and the PHP managed this. I wrote a fairly simple interface and design for this, and it works reasonably well.

I am currently ditching the PHP web interface and going with a Java desktop client. I wrote a Java desktop client for binary navigation and analysis that communicated with the Antivirus and analysis engine using my custom protocol. But because the protocol is custom, it’s hard to expose this publicly through firewalls. I considered a HTTP tunnel using the TRACE method to forward communication from the client through the gateway to my Antivirus server, but I don’t really trust my implementation of a middleware layer. These things are hard to get right, and vulnerabilities are typically common in these types of applications. Corba as a middleware layer is out of the question because it doesn’t handle firewalls very nicely. Likewise, the Windows middleware layer (WCF) is very centric to .net, and I don’t want vendor lock-in at this point since I don’t see Mono at this point being a complete solution for my needs. Web services seem the way to go. The choices are SOAP and REST based services. I have chosen SOAP as the initial implementation using Apache Axis2. I plan to reimplement the PHP web interface into a server-side component based architecture with a java client on the desktop. The server side components will communicate to the Antivirus scanner using the custom protocol previously talked about.

There is whole slew of other questions to address, such as how tightly coupled should unpacking, static analysis and all the other components should be. Should middleware be used to integrate those components? The short answer is yes, and I’m using Java and SOAP to make the system more component based, but I will not address these issues in this post. I think this post has reached its word limit.

New Focus to Blog

It’s been a long time since I last posted and actively maintained this blog.

The reason for not posting is primarily because I have been trying to publish my research while at University. Dual submissions to the blog cause problems with the novelty of research. It seems that it is a hard balancing act to manage.

Then where am I up to now.. I submitted my Masters thesis in May – which includes the automated unpacker I spent many blog posts discussing. Since that time I have now started a PhD at Deakin University, continuing the topic of malware detection and classification.

In trying to balance University and what I can post in a blog, I believe I may have found a way to combine these two worlds.

I have recently been developing User Interfaces to my program analysis and malware classification system. Although I will not be making the source code available, I hope to make a web site or client sofware available for the public. The time frame is not certain, so I doubt it will be public before the end of the year.

The development of the UI seems like an opportunity to keep a video dairy and blog account of how the code progresses. I can’t comment too much on the details of any novel algorithms until the research has been published, but there is much to talk about in a UI and program  analysis in general.

The video progression of the UI can be seen at http://www.youtube.com/silviocesare

This is also an opportunity for people to provide feedback in the direction of the UI.

The interface is still in a very early stage of development. I’ve been coding the Java binary analysis GUI for about a week and a half, and this is my first forray into any real Java programming.

I have spent probably a cumulative time of a month or more on the web interface to the malware classification system.

I hope people enjoy this new direction to the blog – because an active blog is better than in inactive blog!

Classification of Malware Using Structured Control Flow

I am making available my paper from the AusPDC conference http://sites.google.com/site/silviocesare/academicpublications. Any feedback or comments would be greatly appreciated.

Abstract for AusPDC

http://www.acsw2010.scitech.qut.edu.au/acsw2010/Program_schedules/Abstracts.pdf

Classification of Malware Using Structured Control Flow

Malware is a pervasive problem in distributed computer and network systems. Identification of malware variants provides great benefit in early detection. Control flow has been proposed as a characteristic that can be identified across variants, resulting in flowgraph based malware classification. Static analysis is widely used for the classification but can be ineffective if malware undergoes a code packing transformation to hide its real content. This paper proposes a novel algorithm for constructing a control flow graph signature using the decompilation technique of structuring. Similarity between structured graphs can be quickly determined using string edit distances. To reverse the code packing transformation, a fast application level emulator is proposed. To demonstrate the effectiveness of the automated unpacking and flowgraph based classification, we implement a complete system and evaluate it using synthetic and real malware. The evaluation shows our system is highly effective in terms of accuracy in revealing all the hidden code, execution time for unpacking, and accuracy in classification.

See you at IEEE AINA

I will be presenting my research on a real-time flowgraph based malware classification system at the IEEE Advanced Information Networking and Applications (AINA) conference in Perth, Australia, April 2010 – http://www.aina2010.curtin.edu.au/.

An interesting paper on flowgraph classification

I thought I would write a small summary on an interesting paper that appeared at ACM CCS. http://www.ecsl.cs.sunysb.edu/tr/TR246.pdf is a link to the paper ‘Large-Scale Malware Indexing Using Function-Call Graphs’ which originates from Symantec.

Essentially the system identifies variants of malware similar to http://www.vxclass.com. It looks at the call graphs of particular programs and identifies the similarity to known call graphs of malware. If the similarity is high, then a malware variant has been identified. Identifying call graph similarity was introduced in http://www.f-secure.com/weblog/archives/carrera_erdelyi_VB2004.pdf. Call graph similarity is found by identifying common nodes in the graph. These nodes can be identified as having identical control flow graphs, or having the same crc32 over the function represented by the node, and any other ways you can think of. The Symantec paper identifies that similarity is a function of the graph edit distance. The edit distance is the number of operations that must be performed to convert one thing to another. Incidentally, there are related edit distances such as tree edit distances, string edit distances and lots of variations depending on what type of operations are allowed.  In fact, my paper to be published in January uses one of the edit distances I just mentioned. It’s a useful concept.

Vxclass builds graph similarity by the use of a greedy node matching algorithm. It actually tries to find unique nodes first so the greedy heuristic isn’t a problem. But presumably it falls back to the greedy solution once unique solutions are exhausted. The Symantec paper tries to find a minimum cost matching using a bipartite graph matching algorithm based on the Hungarian algorithm. This is novel.

The other novel feature with the Symantec paper is the use of metric trees to speed up searching the call graph database. A metric space (http://en.wikipedia.org/wiki/Metric_space) is defined by a distance function between two particular points or objects . The distance function must have certain conditions that are true, such as d(x,y) >= 0, d(x,y) == d(y,x) and d(x,x) == 0. Also the triangle inequality ( http://en.wikipedia.org/wiki/Triangle_inequality) must hold which is d(x, z) ≤ d(x, y) + d(y, z). If these conditions are true for all objects or points, then that’s great, because performing spatial searches on the data can take advantage of these properties and perform much faster. A metric tree takes advantage of the properties of a metric space and can perform spatial queries identifying similar or nearest neighbours of a point in that space faster than comparing each point or element in the space.

Using the earlier described notion of graph matching to show distance is approximately the same as the graph edit distance. The graph edit distance forms a metric space.  The Symantec paper uses vantage point trees (http://en.wikipedia.org/wiki/VP-tree) but there are other trees such as BK Trees which perform well also. I seem to recall reading that vp trees are most suited for when the distance function is a real number. BK Trees only work on integer distances. I don’t know why Symantec chose vp trees over bk trees – maybe someone else can answer? Perhaps there is no significant difference.

The novelty of the Symantec paper is using metric trees to speed up the similarity search.

The final novel contribution in the Symantec paper is an ordering each malware by specific features such as the number of instructions the malware has. The database is arranged in a b+tree and vp tree structures are kept at the nodes. Then when indexing a malware, they can cull out all malware that is not reasonably close to the features of the input binary, before searching each vp tree in the b+tree buckets. This is a pretty simple optimisation which I think can be improved upon than what was demonstrated in the paper. But Symantec still did a good job in what they did.

This is a nice paper overall that improves malware indexing state of the art. It’s not a revolutionary paper really, but each area has contributions that improve what has previously been investigated. The time it takes to classify a sample is interesting – about 100 seconds for a database size of 100k malware. They want to have it scale up to a million malware.

See you at AusPDC

I received notification tonight that I have been accepted to present at the AusPDC conference. I’ll post an abstract when the conference website publishes it online.

Twitter

Follow me on Twitter http://www.twitter.com/silviocesare.

Updates

I haven’t blogged for a few months. I’ve been busy finishing a prototype malware classification system based on flowgraph similarity. That has resulted in submitting a paper to the 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010) http://www.cse.unsw.edu.au/~rajivr/auspdc2010/. The system I developed and discussed in that submission is not fast enough for realtime use in desktop and EMail gateway AntiVirus.  To remedy that, I’ve also been working on a simpler flowgraph based classification system.  It detects less malware variants but performs in near realtime.  I’ve finished a basic prototype and hope to write up my results and submit to an ACM conference by the end of September.  I will write up more details about both systems after publication, which will be at earliest in January 2010.

Removing semantic NOP’s from Malware

A common obfuscation technique used by malware is by randomly inserting a sequence of instructions that have no other effect on the functionality of the program.  This technique is additionally used in polymorphic and metamorphic malware.

There are a couple of interesting papers that try to address this by identifying some classes of “semantic nops” in the malware and removing them.  The basic model is to identify a particular sequence or set of instructions to examine, typically generated by bruteforce of all possible combinations in each basicblock, and then ask an oracle if its a semantic nop.

There are two papers which are probably of interest.  Malware Normalization http://www.cs.wisc.edu/wisa/papers/tr1539/tr1539.pdf, is an in interesting paper and has several techniques to canonicalize a binary so that it can be more effective when used scanning for in antivirus.  Another paper of interest is http://www.eecs.berkeley.edu/~sseshia/pubdir/oakland05.pdf which is a little more formal (and thus for me, harder to understand), on Semantics-Aware Malware Detection.  This paper performs static analysis on the malware and tries to match it to a template signature which describes particular behaviours of the malware such as its memory access.

The semantic nop oracle that can tell you if a sequence of code is a nop or not, is an SMT decision procedure (http://en.wikipedia.org/wiki/Satisfiability_Modulo_Theories).  SMT is an extension of SAT.  You give it a list of constraints, such as equalities and arithmetic, and then you can query the decision procedure to see if certain conditions can possibly be true or false.  Counter examples can be given in return to prove the results of the decision procedure.

To determine if a sequence of instructions is a semantic NOP, you ask the decision procedure if the set of registers that are modified by the instructions, result in the same values as what they were before hand.  If they are the same, then it’s clearly a NOP.  You also need to check that all memory written to is the same before and after as well.  Care should be taken with the status flags, and the papers mentioned don’t talk in depth about this. Perhaps checking for a lack of liveness in the flags or that the flags are the preserved at the end of a potential nop sequence. All these types of checks are only applied within basicblocks.  Detecting semantic nops as code that branches seems a much harder problem.

This appears to fine in most cases, but what those papers I mentioned earlier fail to describe is the case of exceptions being generated from memory access.  Consider the following code, which for all purposes appears to be a semantic NOP.

mov temp_mem, %eax; # potential exception
inc %eax
dec %eax
mov %eax,temp_mem

Note: status flag are not live at the end of the above code sequence making it eligable as a nop, which means there is a subsequent operation that overwrites the flags before using them.

That code indeed is a semantic NOP, _except_ when an exception occurs from accessing temp_mem.  If an exception occurs, then eax is clearly different from its starting condition when control is transferred to the exception handler.  This opens up a can of worms I think.

If we only look for semantic nops between memory accesses, we will undoubtedly miss a large number of cases that really are nops.  But if we ignore the potential for exceptions, then removing those misidentified nops will result in an unsound transformation.

Another problem was suggested to me, in terms of exceptions resulting from hardware break points. Removing a sequence of code, or applying a transformation to it, may not be sound.

PS.  I implemented a basic and incomplete version of dynamic taint analysis, and am using it to track the return value of GetProcAddress, so I can reconstruct the import table after auto unpacking.  It works fairly nicely.  And I implemented most of the code to transform my IR into STP constraints (STP is an opensource SMT decision procedure).