[plt-scheme] Ocaml or scheme for decompiler?

From: Jim Witte (jswitte at bloomington.in.us)
Date: Fri Aug 6 20:32:15 EDT 2004

   Which language would people recommend to write a C++ decompler for 
ARM code for a specific processor with a specific compiler (which I 
have)?

   Thing's it would have to deal with in the code (which has symbolic 
information for parameters, which makes type inference):

1.  parameter track through registers and stack locations (parameters 
0-3, including object struct address are in r0-r3, others are on 
stack),

2. memory address loading.  Things like

00:  ldr r0, &0x06
04: ldr r0, [r0]
06: dcd 0x FF1E2C

where the code at 06 is a constant and therefore usually turns into an 
assembly language instruction that doesn't make much sense.

3. C-structure offsets:

ldr r1, #0
ldr r2, #4
str r2, [r0,#0]
str r1, [r0, #1]
stmia r0, {r1,r2}

This will store in the structure pointed to by r0 the numbers (bytes) 
{4, 1, 1 2}.  Things get more interesting than this, as in the function

ConvertCodecBlock(SoundBlock*, CodecBlock*)
001E5BEC mov   r2, #0 (0)
001E5BF0 str      r2, [r1]

001E5BF4 ldr      r2, [r0]
001E5BF8 str      r2, [r1, #4 (4)]

001E5BFC ldr      r2, [r0, #4 (4)]
001E5C00 str      r2, [r1, #8 (8)]

001E5C04 ldr      r2, [r0, #8 (8)]
001E5C08 str      r2, [r1, #12 (C)]

001E5C0C ldr      r2, [r0, #12 (C)]
001E5C10 str      r2, [r1, #16 (10)]

001E5C14 ldr      r2, [r0, #16 (10)]             	
001E5C18 str      r2, [r1, #20 (14)]

001E5C1C ldr      r0, [r0, #36 (24)]
001E5C20 str      r0, [r1, #24 (18)]!
001E5C24 mov      pc, lr

   This takes fields from one structure and stuffs them into another.  
The "!" increments the base registers - no one knows why the compiler 
does this..

4. branch tables:
add pc, r0, asr #2
b <function 1>
b <function 2>
b <function 3>
b <function 4>

There's also one variation on this idea, which does something like

ldr r0, <address>
ldr r0, [r0]	;; dereference
add r2, r0, #54	;; add 54 to the r0 pointer
add pc, pc, r2	;; jump to location in that branch table

and probably are others I haven't seen enough to characterize off the 
top of my head.  There are also *very* strange things like functions 
that jump *inside* other functions (this was a hack, I'm told)

5. Structure allocation.  Things like

sub sp, sp #50
mov r1, #50
ldr r0,sp
bl _new_

stack allocates a 50-byte structure below the current stack pointer, 
with the stack pointer at the end pointing add it.  Sometimes two 
allocations are combined into one new:

mov r0, sp
mov r1, #64
bl __new__
bl _Event constructor__ (for a 32 byte structure)
add r0, r0, #32
bl _Event constructor__ (for another event structure)

and would probably decompile to something like

Event r1 = new Event;
Event r2 = new Event;

(the constructor code contains the call to new, so the above wouldn't 
happen - but I've seen things similar to it where there's only one new, 
but multiple structures are used.)

6. Translating arithmetic staments

r0 = r1 * 7 turns into

add r1, r0 asl #2	;; r1=r0 + r0*4
add r0, r0 asl #1	;; r0 =r0 + r0*2
add r0, r0, r1		;; r0 = r0 + r0*2 + (r0+ r0*4) = r0*6

that's just one of the simpler things..

   Ideally, what would be nice is if I could somehow process the code 
(about 40-100MB of text) to get translate the patterns it could, and 
then create a list of recurring patterns that it couldn't.  Then I 
could look at the list of patterns it couldn't translate, figure out 
what they were, and write another pass to the decompiler and run it 
again..  (and perhaps several more times after that)

Jim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3626 bytes
Desc: not available
URL: <http://lists.racket-lang.org/users/archive/attachments/20040806/b9eba00d/attachment.bin>

Posted on the users mailing list.