Monthly Archives: May 2008

Fast string operations, Was x86 CPU bug in rep movsb

UPDATE:  This isn’t a bug after all.  Aspect provided documtation of what is actually occuring.

It’s a feature since pentium pro computers to do ‘fast string’ or block operations.  A block operation (eg, movb) of 64 bytes is performeed if ecx >= 64, if edi is aligned to 8 an byte boundary, and if esi and edi are not both in the same cachline (64 byte block).  Otherwise, it performs a single operations.

This seems to have resolved my emulation problems 🙂

While unpacking MEW in my emulator, I came across an interesting bug.  single stepping through rep movsb with ecx=65 completes the instruction in 2 steps.

movsb copies a byte from the memory pointed to by esi, into the memory pointed to be edi.  the rep part of the instruction, repeats the movsb ecx times.  It does this by iteratively decrementing the ecx register until it is 0.

On my computer, an old P4, single stepping rep movsb with ecx 65,  single steps from ecx=65  to ecx=1.  This is incorrect (I presume), it should single step through every decrement of ecx.

nemo courteously tested this bug on his own PC, and reported that it single stepped through every decrement of ecx.  This bug is probably specific to my CPU type.

cpu bug, repne changes status flag in scasb

Another CPU bug uncovered while testing my emulator.   I came across a repne scasb while emulating the win32 version of upx.  The logic of scasb (scan string), to paraphrase the intel manuals is


SRC = dereference(edi)
temp = al -  SRC
SetStatusFlags(temp)
update_edi

In the code I ran across, %al was set to 0, the byte at (%edi) was 70 (decimal).   %ecx was large.  Following the operation, the carry flag was cleared.  This is incorrect, the carry flag should be set (0 – 70 sets carry).

I was unsure if my understanding of carry was wrong, so I tried 0 – 70 in a sub.  Carry was set as expected.  scasb’s logic is to perform a temporary subtraction of %al-(%edi) and set the status flags using the temporary result as explained earlier.

When scasb was performed in isolation with the same test case, carry was set.  It seems that including repne in the scasb, changes the carry flag to an incorrect result.

gdb leaves file descriptors open in debugee

I have my emulator running reasonably successfully on upx now.  It’s actually an auto unpacker, and identifies when the program is unpacked by monitoring execution on previously written memory.  In the process of emulating file io I came across a particular bug in gdb.

The file descriptor returned from an open call inside the debuggee, was 6.  I was expecting 3.

stdin=0, stdout=1,stderr=2

gdb must be using file descriptors 3,4,5, and forgot to close them before calling execve.

I’m not sure what the descriptors are used for.  Anyone care to take a look?

In the best case scenario, this bug can be used for another test to see if a debugger is present, and in the worst case if these file descriptors were used for control, *gasp* control gdb?  Probably they arent used for anything important, but I havent looked any furthur..

CPU Bug x86 shl behaviour sets overflow flag

I’ve been writing an x86 emulator, and to debug it, I ran it on a p4 computer in parallel to a debugger on a target program (a upx packed binary).  Well.. I got to shl $8, %eax where eax = 0x00ffffff.

The intel documentation says that the overflow flag is only changed for 1 bit shifts.  Suprisingly, in the 8 bit shift, the overflow flag became set.  In a 7 bit or 9 bit shift of the same value, the overflow flag remains clear (or perhaps unchanged).

I’ve been googling to see other reports of this undocumented behaviour, but either its not out there, or more likely my googling skills are poor.  I couldn’t find a reference.

Anyone got more information on this?

[Update:  I have had reports from one person which said the behavior varied between setting and clearing the flag depending on the cpu.]

Merging basic blocks to deobfuscate non continugous control flow

In some binaries, basic blocks may be connected only by jumps.  These basic blocks may also be non contiguous in the file, ie scattered throught the binary.

In cases like this, if your looking at the disassembly, you need to constantly jump throught the image to have the logical order of the control flow.  When the control flow is graphed, it appears logically linear, but when reading the code, it sometimes help to go for the older text dump of the disassembly.

The way I implemented this, was to construct a control flow graph of each procedure.  Then merge basic blocks with their predecessor iff only one predecessor exists and that this predecessor only has one successor (the original basic block we are looking at merging).  To dump the disassembly, a recursive approach for each basic block is taken.  Dumping the assembly representing the current basic block, the next linear basic block (applied recursively), and the branched basic block (if it exists.  also applied recursively).

I made these improvements to my disassembler, so it prints the disassembly in logical order, following the jumps.  In at least one piece of malware out of a sample of about ten, this deobfuscation proved successful, and over 800 basic blocks were merged in an object with around 14000 instructions.  The malware samples I’ve been using have come from http://www.offensivecomputing.net/

I’m in the process of looking at more malware samples to see how common this type of obfuscation is.  If anyone can, names of malware samples would be great for me to look at and run my disassembler.

Probably more useful that the deobfuscator I’ve described is an automatic unpacker.  Most of the malware is packed, and infact, the disassembly is non trivial since indirect jumps and calls seem common.  This might be something that I will work on in the future.

In at least one other malware sample I have, dead code is common.  That is, registers are assigned, modified, then reassigned new values (without making any furthur use of the original references) making the older references dead.  I would like to automate this, and liveness analysis should be able to identifify these cases, however, I have yet to implement dataflow analysis in my disassembler..