Monthly Archives: August 2008

Linux Kernel Memcheck working (but no kernel bugs found in booting)

It’s been a busy few days.. I was truely struggling for a while making no progress, but it all seems to have resolved itself.

I have a working memcheck tool for the Linux Kernel that checks for out of bounds memory access in both the page and slub allocator.  It honours the original size requests for kmalloc etc, so the tool may find bugs which may go unnoticed due to the fact that allocations are often rounded up to confer to the slub’s internal workings.

I tried booting Fedora 8 with a modified 2.6.26 kernel with my tool running, and there were no reports of out of bounds access.  However, my code only actives once the memory subsystem is initialized, which results in a reasonable amount of code not being covered.

For better results, a kernel stress test should be performed using public testing tools for kernel developers.  This is something I plan on doing next.

I am likely to release this tool as a QEMU fork and set of kernel patches at the Ruxcon security conference later this year.  I will be disappointed if I don’t find any real bugs with it, but I think there is still value in the code I’ve written as a validation and testing tool.

— how it works

I’ve explained this really in earlier posts, but I’ll say it again since its really working now.  The operation of it is quite simple.

QEMU is modified such that it tracks execution of the Linux Kernel running as a guest.  Heap operations that allocate and free memory are intercepted (really they are dynamically instrumented).  In Linux, this is the page allocator (which uses a buddy system), the slub allocator, and the bootmem allocator.

A small set of kernel patches is used in the guest to enable easier intercepting of these functions.

A representation of the heap is constructed using the information returned from the intercepted heap functions.  This representation is held by the QEMU host.

All memory operations performed by the Linux Kernel guest are intercepted/instrumented.  The memory access, if in the range of being on the kernel heap is checked against the heap representation.  If the address is not present as part of prior allocation, a report containing the stack trace of the kernel at that point is generated.

Stack traces are symbolic, and the of the guest Kernel must be provided to the QEMU host.

Free’ing non allocated memory, and use after free (really an out of bounds access) is also detected with accompanying reports.

The implementation is fairly straight forward, and has a small footprint nearing about 1000 lines for the QEMU changes, and less than 400 additional lines of for the Linux Kernel which mostly consists of instrumenting and wrapping preexisting code.

QEMU Hacking (part 2)

I’ve written most of the code for MemCheck.  Enough to start testing and tweaking.  I instrumented the __ld and __st functions to check that memory accesses are within allocated buffers.  Initiallly I thought I would use the memory watchpoint code in QEMU to do the instrumentation, but this turned out fruitless as there is special handling/optimisation for the watchpoints, and to top it off, the helper functions use only physical addresses.

The problem I’ve encountered so far is the use of alloc_pages (__get_free_pages etc) in the kernel.  kmalloc and kmem_cache_alloc use alloc_pages internally.  But these symbols are exported publicly and general code is free to use them also.  Also, alloc_pages returns a struct page *, when I really want virtual addresses.  At first I tried writing a macro for the page -> address translation, but it seems there is an actual function on my kernel build that does it.  I then tried intercepting the 1st call to page_address (the translation code) after alloc_pages to get the virtual address.  This almost worked.

I think the best solution I have is to hack on the kernel sources, so that there is strict seperation between public and private interfaces.

QEMU Hacking

1) QEMU translation blocks

I wrote the basic code for function interception in QEMU; its probably overengineered with too many layers of indirection (aka, might be slow).  One thing that caught me out for a short while, was an optimisation QEMU does with translation blocks.

QEMU translates native instructions into ‘micro operations’ and builds them up as ‘translation blocks’.  When execution occurs, one of the first things that happen is that a lookup is made to find a translation block that has already been created.  Translation blocks are done as small units of code, generally finishing at a basic block boundaries.  QEMU has a nice optimisation where if a call or jump is performed, the current translation block simply continues at the target address of the call/jmp.   This makes the translation blocks much bigger, and results in faster execution (no setup time to call or return from a translation block – and allows more optimisation to occur inside the translation block too).

Initially I was checking the program counter being at an intercepted function at the start of each translation block.  But this is incorrect.  With the description I just gave, a translation block might cover more than 1 native basic block.  I disabled this feature in QEMU for kernel code so my interception would work.  There are certainly better ways to perform interception than what I did.  I will consider optimisation later, as for now I’m just happy to have it working.

2) Linux kernel calling convention

I then tried intercepting __kmalloc and kfree from the Linux Kernel.  It wasn’t working too well, as the arguments I was intercepting seemed incorrect.  It wasn’t until I disassembled vmlinux that I realized that the calling convention was using the registers for argument passing.  I don’t know when this was introduced into Linux (I guess its been some years since I’ve done any kernel hacking).  I don’t even know the gcc option to do it!

3)  Heap management in Linux

After I fixed the calling convention, and intercepted __kmalloc and kfree correctly, I built my own representation of the heap.  It didn’t work.  I was getting reports that kfree() was being called on pointers that hadn’t been allocated by __kmalloc.

First thing, is there are a number of __kmalloc functions.  __kmalloc_node_track_caller, __kmalloc_track_caller and a couple others depending on kernel configuration.  Intercepting all these calls helped a little, but not much.

After looking at the source code, there is actually a layer above __kmalloc.  Its dependant on which heap allocator is being used, slab, slob, or slub.   But kmalloc() is actually an inline function that can potentially call kmem_cache_alloc or __get_free_pages.  I tried intercepting kmem_cache_alloc and this resolved most of the problems, but I left out __get_free_pages as I would have to contend with __free_pages also, which leads to alloc_pages aswell.

That might seem like a good approach, intercepting everything, but its not.  The problem is that precision is lost.  kmalloc works with buffers of really any size.  By the time __get_free_pages or kmem_cache_alloc is called, the buffers are all rounded up to cache sizes or page sizes.  If I intercepted kmem_cache_alloc and friends, it could lead to me not identifying buffer overflows because of that lost precision.  They wouldn’t be exploitable overflows in all likeihood, but they would be overflows none the less.  These types of overflows are really the ones my tool would be best to detect.

My solution is to make kmalloc() non inline, and export it to the modules.  Also exported and non inline is kmalloc_node.   I’m building a new kernel as I type, but its probably not going to be finished until tomorrow, so no more coding for the day..

4) doesn’t match /proc/kallsyms

This was somewhat annoying.  In my deault kernel build, and also in Fedora 8, the doesn’t match /proc/kallsyms.  The bug is talked about at

I changed the kernel config with the fix described and am hoping that it will be resolved.

Port of (i386) QEMU to C++

It’s been a rather long week, as I’ve been building QEMU in my cygwin/mingw environment.  My build environment is cygwin, but I use -mno-cygwin to use the mingw compiler.

I don’t want to delve too much into building QEMU since I really should include the patches I made to get everything working.  I haven’t included those patches as part of this post, so this post is more just a status update on what I’m up too.

QEMU has dependancies on several other packages, including GNU TLS, which doesnt build cleanly in mingw/cygwin.  I did manage to get it too build however, which enabled me to build QEMU.  QEMU 0.9.1 doesn’t build cleanly, but only for one of the non i386 architectured.  I wrote a patch which I’ll link too in another post.

0.9.1 is also buggy in Windows.  I downloaded QEMU Manager which is a GUI frontend.  The latest version uses either 0.9.0 or 0.9.1 depending on configuration.  0.9.1 crashes almost immediately.  0.9.0 is OK.  The current SVN sources are also bug free, although they required some patches to build in mingw/cygwin.  I might post the patches I wrote, but there is probably not much demand for this.  All I did was disable one particular new feature introduced since 0.9.1.

I thought I’d go about tracking down the bug in 0.9.1 since I wanted to do my QEMU changes to a reasonably recent version (0.9.1 is the current stable version).  I knew 0.9.0 was working, and 0.9.1 was buggy, so somewhere in between the bug was introduced.

I took SVN revisions of the 0.9.0 and 0.9.1 tags, and did a binary search.  I would take the middle revision and build it (making appropriate patches when necessary to compile), then check if it crashed.  I eventually tracked down the revision that introduced the crash.  I then looked over the diff but couldn’t immediately see the bug, so I started to only apply patches to individual files (actually I did it backwards, by reverting files to their original state).  I had the patch isolated to a single file causing the crash.

I then decided I would compare the code causing the crash to the current SVN which has fixed the bug.  I found the relevant lines of code, and also a checkin message talking about fixing memory corruption under windows.  This was surely the problem, so I patched by hand the code and voila, no more crashing.

I’m not sure if its the best idea to not use the current SVN as my base for modifying QEMU.  For now I’ll stick with the most recent stable release..

Also another setback, is that I was unable to get the accelerator Kqemu to work.  When installed, QEMU would crash during booting an image.  I think this may have been a problem with Vista as KQemu is said not to be tested under this host.  The funny thing is, in QEMU Manager, KQemu seems to work (is it really running?). 

I plan to write a memcheck tool in the spirit of valgrind for QEMU running the Linux Kernel.  I prefer hacking in C++ and having access to STL.  I guess this is to everybodys favour.  C++ seems to be loved or hated.  I went about porting QEMU to compile with g++.

Mostly needed, were explicit casts from void *.  Several parts of the code used reserved words from C++ like private, or class.  Painful to port was c99 static initialization that used designators, eg, struct foo f { .member = 1; }.  g++ doesnt implement c99 style initialization.  I removed the designators from some of the code, expanding the intiatilization to include all the initial members of the structs.  I found out later that there is a GNU extention which does the equivlanet, { member : 1 }.  Turns out at least on my version of g++ it doesnt work all the time.  It complains that its initializating too many members in some cases.  Maybe in a more recent version than the cygwin supplied mingw gcc it is fixed.

Most painful was when array indexes were being initialized char foo[] = { [10] = ‘a’; }.  I expanded some of the arrays like the SSE instruction handling, but its very easy to make it buggy with an extra or missing line.  For some parts of the code, i simply turned the static initialization into a runtime affair by making an initialization function with the gcc attribute of constructor.  Then i used some search/replace expressions to convert the static initialization into runtime initialization.

I also had alot of pain with QEMU linking.  Firstly, the use of INLINE for inline functions is pretty bad.  It defines functions as inline inside one source file, then uses them in another source file with the only glue linking them together, being the linker.  ie, not inline at all.  In C++, inline functions need to be defined so all code that uses them, sees the defintion.

The biggest dram with linking was in QEMU createing an object file with operations performed by the JIT.  Another program parses the object files extracting the code, to use in the JIT aspects of QEMU.  Now, I had to compile the object file as extern “C” to avoid C++ name mangling.  Turns out the definitions and prototypes for this, is all over the shop.

Well, to summarize.  I have working a C++ port of QEMU for i386 (but none of the other architectures).  I tested it by installing Fedora Linux, and then building a new kernel image (with frame pointer enabled, so I can get a good stack trace for my memcheck code).  The kernel compile took over 10 hours (ouch).

I’m not sure if anyone besides myself is interested in a C++ version of QEMU.  I have a C++ version of dynamips also if anyone is interested.

Valgrind memcheck for Linux Kernel

I have started a number of projects and most are incomplete, so how about why not start another.  But really, there is a natural progression to the projects I’ve been working on.

I started a static analysis tool (was working, but compile is now broken due to mixing stdio and iostream and using part of the sources which I then modified in another project), then wrote an x86 emulator (working, but not complete) which led me into a more ambitious emulator project in symbolic execution of IOS by modifying dynamips (project not working), which I then simplified in a side project to implement a Valgrind like memcheck tool (not really working, too many false positives – an IOS API that modifies the heap which I still have to reverse).

So the next diversion could be Valgrind memcheck for the Linux Kernel.  This will really be no different to the code I implemented in dynamips.  Except of course, I can’t use dynamips anymore – I’ll probably use Qemu, or alternatively have the option of using Bochs (but I’ll use Qemu).  It will really be a simple modification I would make.  Intercept malloc/free (or whatever the Linux kernel equivalent is), maintain my own view of the heap in Qemu and then intercept memory reads/writes to the heap and make sure they are in an allocated buffer.

In the past day or two, before I thought of this latest idea, and to spur some new activity (as I haven’t been making progress on my other projects), I started looking at PowerPC assembly.  Dynamips uses ppc for the Cisco 1700 emulation, and the images for this machine are alot smaller than the c7200’s (but still pretty big).

But back to the valgrind project.. I’ve looked at the Qemu sources before, but didn’t try any modifications.  The biggest hurdle for the Valgrind like code I plan to write is interfacing with Qemu.  The actual code itself for maintaining a view of the heap and checking reads/writes is fairly trivial.. Fortunately, even though x86 memory writes are possible in many types of instruction, Qemu generates micro operations for x86 which single out the memory accesses.

I also finished a University assignment today, so might try to play with Qemu over the next couple of days..

Instrumenting C sources

I was getting tired of reversing IOS yesterday, so I thought I’d look into into symbolic execution of C source code.  I am using the ExE paper as a reference on how to go about doing it.  Actually, they (Dawson Engler and co) released an ExE paper a few years with basically the same material, except it seems in the recent paper they use CVCL instead of STP for the constraint solving.

It seems the approach they took is to instrument the target source in question witht the symbolic execution checks and execution forking.  This seems like a pretty good idea, so I thought the first thing I should try to do is do the source code instrumentation.

ExE uses CIL to do the instrumentation.  CIL is written in OCaml and goes about representing C as an Intermediate Representation (IR) that is fairly close to the original source.    The reason for keeping it close to the source is that one of the primary uses of CIR is source-to-source transformation.  CIL has a bunch of libraries to do typical compiler analysis, like control flow, data flow, it even has points-to analysis.

The problem for me with CIL,  is that it’s written in OCaml, which is a language I haven’t used before.  I’ve written some very simple programs in Lisp before, but that was many years ago.  I read a couple of quick tutorials last night on OCaml, but I need to read a bit more before hoping to jump into CIL.

Another project I looked at today was TXL  It is another source transformation rule where you write a grammar for your source language and then use rewrite rules for the transformation.  It seemed like a reasonable approach and easier to do than learning OCaml in an evening, so I intalled it to have a play.  Mind you, I had to edit the install script and rename the directory it came in to get it to install, but it was OK.

I thought I’d jump straight into it.  Basically, what needs to be instrumented are expressions that use variables, but then I hit the reality that will be of how I was actually going to implement this.

First, how do we keep track of symbolic state of every variable.  We could wrap every variable in a struct, and have a member specifically to store state on weather its symbolic or concrete.  The problem with this, is that it will break alot of code that expects variables to be of a specific size.

Another solution could be keep a list of pointers to variables.  So to make a variable symbolic we could instrument with something like (eg for variable int x) make_symbolic(&x);  This is how I think ExE does it, although they are somewhat ambiguous with the actual details.  Lets say we keep a global map of variables that we are tracking as symbolic.  Each entry in our map contains a pointer to the variable in question.  To check if a variable is symbolic or not, we simply check if the address of that variable is in our map.  The problem with this approach is scoping of variables, and procedure parameters and return values.

A local variable’s address is only valid for the procedures activation record, that its in.  So we would need to keep an activation record for each procedure and keep track of local variables that we put in our symbolic map that we talked about earlier.  This seems hard to implement with TXL, since its not really a programming environment.  Which is annoying, because I dont really want to learn OCaml although it would be a handy skill to have.

The other options I have are to either write my own instrumentation code which is to write a C frontend, a transformation middle, and a backend that emits C code.   There is bison (the gnu successor to yacc), or even Antlr which supposedly is much cooler.  There are grammars available, although it seems quite difficult to get a grammar to parse GCC specific code for bison.  I’ve never written a parser for a significant language before, so not sure how I’d do if I decided to take this approach.

Or maybe modify sparse or GCC (or even LLVM).  I had a quick look at sparse which is essntially a compiler frontend that parses (GCC) C.  It has a couple example backends, eg, a compiler, and of course the static analyser.  It didn’t compile out of the box in my cygwin environment.  The OS environment variable in my (default cygwin) environment was set to WindowsNT, and sparse was expecting it to be set to cygwin.

I’m kind of scared of compiler code to be honest, even though I shouldn’t be I guess.  I also downloaded the GCC and LLVM sources.  LLVM has a backend that generates C code.

I suspect I will have to end up learning OCaml if I’m going to get something implemented..

Processes in IOS

I had this idea, that maybe the reason I was getting alot of false positives from my memcheck tool which claimed memory writes were happening outside of allocated blocks, and the writes werent part of malloc, free or IncMallocRefCnt was that maybe a process like the ‘Chunk Manager’ was working in the background making those modifications.

CreateThread which Lynn documented in his Cisco exploitation, creates a new process (or really thread) and returns the process ID (lynn documents this as returning void *).  FX documents in his phrack article, that there exists a ‘Process Array’ that contains an array of pointers to Process Control Blocks.  You can see the allocation by using ‘show mem’ in IOS, or intercepting the malloc calls as I do in Dynamips.

The Process Array is a block of 1032 bytes, the first 4 bytes appear to be 0, and i suspect this field might actually be a pointer to the next block if the current block becomes full of processes.  The next 4 bytes are the number of active processes, or maybe the highest allocated pid.  The remaining 1024 bytes, is an array of 32 bit words pointing to process control blocks (pcb’s).  pid 1 starts at the beginning of this array (ie, 8 bytes in from the initial block).  A 0 for the pcb pointer indicates not in use, seemingly.  I think 256 processes seems very small, so back to that first word being a pointer to the next Process Array, but I would need to confirm that.

A PCB contains a few things, as far as I can tell, it contains a pointer to the stack base, the next word contains according to FX the stack pointer, the next word I think was the entry point of the thread (from memory.. I will have to re-run my code to verify this again, but I commented out the memory dumps for now).  The next fields I think, but not 100% sure contain saved registers.  At least the return address seems to match with what I believe is a context switch function I found in IDA, or rather, the return address points to the return address after a yield/context switch.

From looking at this context switch function, which I dont fully follow its working with the stack pointer, it uses a  global variable which I think is a pointer to the current PCB.  Looking at memory writes to this value, it certainly appears to use PCB’s that CreateThread creates.

There also appears to be another PCB in use (assuming that global currentPCB variable is correct), which I think represents the init process or pid 0.  Not too sure about that, but it makes the most sense.  It seems that a process can yield to the init process which can perform work.. at least, when i’m checking what the current process is when one of the memcheck false positives occur (like i talked about in the first paragraph). 9 times out of 10 its coming from the same unknown PCB.

After booting, I’m getting about 9149 false positives from what I think is the init process, and around 855 false positives coming from around 13 other processes.  These other processes seem random, one example being RADIUS INITCONFIG.  There must be one or mare functions called by these processes that modify the heap structures, but it doesn’t seem to be all at the hands of the Chunk Manager process..

Back to the drawing board..  having to do so much reversing of IOS is really slowing me done.  In the first couple days of working on the symbolic execution, I wrote nearly 3000 lines of code to do the constraint generation from parsing MIPS asm.. But in the past week, I’ve barely written 300 lines, and it’s all wild stabbing in the dark.

A (very simple) way of finding malloc() in IOS

I suppose this is the simplest way of finding malloc() as IOS tells you exactly where it is!

Does do a ‘show mem’, and IOS gives you the ‘Alloc PC’ which is the address of the callsite where malloc is called.  There are quite a number (nearly 20) of these mallocs, all of them calling a common function _malloc (or whatever you want to call it) but with slightly different arguments.  The malloc wrappers also provide the Alloc PC address as an argument (which is derived from the return address) to _malloc.

I guess finding malloc is just a matter of reading an address provided by IOS!

CreateThread in IOS, identifying the 3rd argument from Lynns prototype

I was looking at Michael Lynn’s presentation on IOS exploitation and rootkit development, and thought I’d have a crack at finding some of the functions he documents.  I cheated, because Lynn had already done the work to find the function prototype where he identifies an argument as being the process name.  This is enough to find CreateThread.  Simply match the address of the string (search sequence of bytes some names you get from ‘show proc’ in IOS, eg “Chunk Manager”).  Then match the address with an operand (search immediate value).  You might have to search for the upper and lower 16 bits individually as MIPS sometimes needs this to load a 32 bit address.

In Lynns slides, he documents the prototype as –>

void *CreateThread(void *entryPoint, char *name, int something, int dunno);

The 3rd argument (something) I identified as the stack size (which can be seen in ‘show proc’).  I had been looking at the stack on each process while trying to work out my call tracer (which I couldn’t get to work), so when I saw the value in CreateThread, I quickly put 2 and 2 together.

As for that fourth argument, dunno..  which brings me to my next point.  I really should learn IDA scripting, as putting the entry points for each process into IDA with meaningful names would be very handy.

Getting an accurate call trace of running IOS not so easy

I’ve spent the past little while looking at why my call trace code grows but doesnt subsequently shrink.  I am maintaining a stack of most recent calls (JAL(R) jump and link register).  I pop an element from the stack when the program counter changes to the return address at the time of the JAL(R).  Not perfect, but OK.  The problem is, it naively doesn’t work properly.  It grows without bound.

The reason for this is surely because I’m trying to trace multiple processes, and the scheduler inside IOS.  I was silly for thinking my naive implementation would work.  According to the Cisco documentation, IOS uses ‘run to completion’ scheduling.  I’m not an expert, but this sounds like an implemention of user threads.

Looking at the call trace I’m generating, I’m see a group of calls being repeated.  This would presumably be what causes the trace to grow without bound.  Looking at the functions in IDA, I can see one of them saving state, including the $s registers, the return address, frame pointer, and the stack pointer.  Then restoring some presumabley previously stored state and returning to the new (or old rather) location.

I’ll hazard a guess, and call that trampoline function context_switch().  I imagine what happens in IOS, is a process executes its code as per normal, and then every now and then calls yield() (or maybe reaches a syscall that blocks waiting on io), possibly puts the current process in the wait queue, then finds a process on the ready queue and performs a context switch().

Developing a call trace is a tricky thing, because you really need the state of each process to do so accurately.  In x86, when the frame pointer is used, knowing the stack pointer is enough to build a call trace, but in MIPS its not as easy since the return address is saved in a register, and then optionally stored in a local stack frame at an offset that isnt entirely constant.

It seems like a goal of its own, to develop a call tracer that works in a multi process/threaded environment, so I am not sure entirely what I will do.