Instrumenting C sources

I was getting tired of reversing IOS yesterday, so I thought I’d look into into symbolic execution of C source code.  I am using the ExE paper as a reference http://hci.stanford.edu/cstr/reports/2006-01.pdf on how to go about doing it.  Actually, they (Dawson Engler and co) released an ExE paper a few years with basically the same material, except it seems in the recent paper they use CVCL instead of STP for the constraint solving.

It seems the approach they took is to instrument the target source in question witht the symbolic execution checks and execution forking.  This seems like a pretty good idea, so I thought the first thing I should try to do is do the source code instrumentation.

ExE uses CIL http://manju.cs.berkeley.edu/cil/ to do the instrumentation.  CIL is written in OCaml and goes about representing C as an Intermediate Representation (IR) that is fairly close to the original source.    The reason for keeping it close to the source is that one of the primary uses of CIR is source-to-source transformation.  CIL has a bunch of libraries to do typical compiler analysis, like control flow, data flow, it even has points-to analysis.

The problem for me with CIL,  is that it’s written in OCaml, which is a language I haven’t used before.  I’ve written some very simple programs in Lisp before, but that was many years ago.  I read a couple of quick tutorials last night on OCaml, but I need to read a bit more before hoping to jump into CIL.

Another project I looked at today was TXL http://www.txl.ca/.  It is another source transformation rule where you write a grammar for your source language and then use rewrite rules for the transformation.  It seemed like a reasonable approach and easier to do than learning OCaml in an evening, so I intalled it to have a play.  Mind you, I had to edit the install script and rename the directory it came in to get it to install, but it was OK.

I thought I’d jump straight into it.  Basically, what needs to be instrumented are expressions that use variables, but then I hit the reality that will be of how I was actually going to implement this.

First, how do we keep track of symbolic state of every variable.  We could wrap every variable in a struct, and have a member specifically to store state on weather its symbolic or concrete.  The problem with this, is that it will break alot of code that expects variables to be of a specific size.

Another solution could be keep a list of pointers to variables.  So to make a variable symbolic we could instrument with something like (eg for variable int x) make_symbolic(&x);  This is how I think ExE does it, although they are somewhat ambiguous with the actual details.  Lets say we keep a global map of variables that we are tracking as symbolic.  Each entry in our map contains a pointer to the variable in question.  To check if a variable is symbolic or not, we simply check if the address of that variable is in our map.  The problem with this approach is scoping of variables, and procedure parameters and return values.

A local variable’s address is only valid for the procedures activation record, that its in.  So we would need to keep an activation record for each procedure and keep track of local variables that we put in our symbolic map that we talked about earlier.  This seems hard to implement with TXL, since its not really a programming environment.  Which is annoying, because I dont really want to learn OCaml although it would be a handy skill to have.

The other options I have are to either write my own instrumentation code which is to write a C frontend, a transformation middle, and a backend that emits C code.   There is bison (the gnu successor to yacc), or even Antlr which supposedly is much cooler.  There are grammars available, although it seems quite difficult to get a grammar to parse GCC specific code for bison.  I’ve never written a parser for a significant language before, so not sure how I’d do if I decided to take this approach.

Or maybe modify sparse http://www.kernel.org/pub/software/devel/sparse/ or GCC (or even LLVM).  I had a quick look at sparse which is essntially a compiler frontend that parses (GCC) C.  It has a couple example backends, eg, a compiler, and of course the static analyser.  It didn’t compile out of the box in my cygwin environment.  The OS environment variable in my (default cygwin) environment was set to WindowsNT, and sparse was expecting it to be set to cygwin.

I’m kind of scared of compiler code to be honest, even though I shouldn’t be I guess.  I also downloaded the GCC and LLVM sources.  LLVM has a backend that generates C code.

I suspect I will have to end up learning OCaml if I’m going to get something implemented..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s