[plt-scheme] Ocaml or scheme for decompiler?
Which language would people recommend to write a C++ decompler for
ARM code for a specific processor with a specific compiler (which I
have)?
Thing's it would have to deal with in the code (which has symbolic
information for parameters, which makes type inference):
1. parameter track through registers and stack locations (parameters
0-3, including object struct address are in r0-r3, others are on
stack),
2. memory address loading. Things like
00: ldr r0, &0x06
04: ldr r0, [r0]
06: dcd 0x FF1E2C
where the code at 06 is a constant and therefore usually turns into an
assembly language instruction that doesn't make much sense.
3. C-structure offsets:
ldr r1, #0
ldr r2, #4
str r2, [r0,#0]
str r1, [r0, #1]
stmia r0, {r1,r2}
This will store in the structure pointed to by r0 the numbers (bytes)
{4, 1, 1 2}. Things get more interesting than this, as in the function
ConvertCodecBlock(SoundBlock*, CodecBlock*)
001E5BEC mov r2, #0 (0)
001E5BF0 str r2, [r1]
001E5BF4 ldr r2, [r0]
001E5BF8 str r2, [r1, #4 (4)]
001E5BFC ldr r2, [r0, #4 (4)]
001E5C00 str r2, [r1, #8 (8)]
001E5C04 ldr r2, [r0, #8 (8)]
001E5C08 str r2, [r1, #12 (C)]
001E5C0C ldr r2, [r0, #12 (C)]
001E5C10 str r2, [r1, #16 (10)]
001E5C14 ldr r2, [r0, #16 (10)]
001E5C18 str r2, [r1, #20 (14)]
001E5C1C ldr r0, [r0, #36 (24)]
001E5C20 str r0, [r1, #24 (18)]!
001E5C24 mov pc, lr
This takes fields from one structure and stuffs them into another.
The "!" increments the base registers - no one knows why the compiler
does this..
4. branch tables:
add pc, r0, asr #2
b <function 1>
b <function 2>
b <function 3>
b <function 4>
There's also one variation on this idea, which does something like
ldr r0, <address>
ldr r0, [r0] ;; dereference
add r2, r0, #54 ;; add 54 to the r0 pointer
add pc, pc, r2 ;; jump to location in that branch table
and probably are others I haven't seen enough to characterize off the
top of my head. There are also *very* strange things like functions
that jump *inside* other functions (this was a hack, I'm told)
5. Structure allocation. Things like
sub sp, sp #50
mov r1, #50
ldr r0,sp
bl _new_
stack allocates a 50-byte structure below the current stack pointer,
with the stack pointer at the end pointing add it. Sometimes two
allocations are combined into one new:
mov r0, sp
mov r1, #64
bl __new__
bl _Event constructor__ (for a 32 byte structure)
add r0, r0, #32
bl _Event constructor__ (for another event structure)
and would probably decompile to something like
Event r1 = new Event;
Event r2 = new Event;
(the constructor code contains the call to new, so the above wouldn't
happen - but I've seen things similar to it where there's only one new,
but multiple structures are used.)
6. Translating arithmetic staments
r0 = r1 * 7 turns into
add r1, r0 asl #2 ;; r1=r0 + r0*4
add r0, r0 asl #1 ;; r0 =r0 + r0*2
add r0, r0, r1 ;; r0 = r0 + r0*2 + (r0+ r0*4) = r0*6
that's just one of the simpler things..
Ideally, what would be nice is if I could somehow process the code
(about 40-100MB of text) to get translate the patterns it could, and
then create a list of recurring patterns that it couldn't. Then I
could look at the list of patterns it couldn't translate, figure out
what they were, and write another pass to the decompiler and run it
again.. (and perhaps several more times after that)
Jim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3626 bytes
Desc: not available
URL: <http://lists.racket-lang.org/users/archive/attachments/20040806/b9eba00d/attachment.bin>