I’ve been trying to figure out what I need to do to implement dynamic binary translation in my emulator.
The most common theme when translating a basic block, is to disassemble each instruction. From this generate an intermediate representation. After the basic block is converted to this IR, perform any reasonable optimisations on the new code. Then when done, generate native code from the IR.
The IR might use a small number of registers, and these registers can be statically mapped to a host. QEMU follows this approach. Guest registers are kept in memory. This is probably the simplest approach to register allocation, but I suspect the native code QEMU generates can be greatly improved by keeping guest registers in real registers on the host and spilling when necessary.
So QEMU when faced with the x86 instruction, inc %ebx; inc %ebx, it generates IR similar to the following.
mov &context->ebx,%TEMP1
inc %TEMP1
mov %TEMP1, &context->ebx
mov &context->ebx,%TEMP1
inc %TEMP1
mov %TEMP1, &context->ebx
That’s not so bad but when you consider that most instructions in the basic block access the same register, it would be really nice if the IR generated code like this –>
mov &context->ebx,%ebx;
inc %ebx
inc %ebx
mov %ebx, &context->ebx
Clearly, there are advantages to using the QEMU approach, as register allocation problems are practically eliminated and at most only those temporaries defined by the IR are used in each host.
I haven’t looked in great depth at Valgrind, but it seems to use the more optimal approach to register usage and tries to map guest and host registers using a register allocation scheme. This is something I have to look into further.
But back to _my_ problem in designing an implementation. Let’s look at the x86 instruction add %eax,10(%ebx,%ecx,4)
If our IR consists more of micro operations that reasonably faithfully translate the original instruction and not using the temporaries tricks that QEMU uses, a translation could result in something like this
mov %ecx,%temp1
mul 4, %temp1
add %ebx,%temp1
add 10,%temp1
LoadMem %temp1,%temp2
add %eax,%temp2
StoreMem %temp2,%temp1
All that work just to simplify an address mode!
As you can see, a problem is that we now require an extra 2 registers! If we went about translating that IR back to x86, we would not be able to statically map all the registers without spillage. I suppose we could always reserve 2 or so registers in the host, which would be alot easier than a register allocation algorithm..
The above is still a simplification.. ReadMem and StoreMem are most likely going to be functions in our software MMU implementation.. so this could result in the following –>
mov %ecx,%temp1
mul 4, %temp1
add %ebx,%temp1
add 10,%temp1
push %eax; // ?? eax is the return value of LoadMem, AND a guest/host register
push %temp1
call LoadMem
mov %eax,%temp2
pop %eax
add %eax,%temp2
push %temp2
push %temp1
call StoreMem
What about the return value of LoadMem? This uses up _another_ register! That’s 3 registers. Maybe another solution is to have LoadMem always store the memory contents not as a register, but in memory. This saves us a register, but may or may not be slower or faster.
I also had a look at the http://www.zynamics.com IR which they call REIL (reverse engineering intermediate language). An interesting aspect is they have only one conditional instruction, set if boolean zero. I guess they went to some work to change the eflags (or other status flags/condition codes for non x86) evaluation to be compatible with that. They clearly went for the simpler IR is the best approach.
OK. I’ve been rambling for long enough.. I highly doubt anyone would find this interesting besides me.