[racket] SIGSEGV MAPERR si_code 1 fault on addr 0x7...; can't isolate or consistently reproduce in source code; stack trace points to scheme_uncopy_stack

From: Matthew Eric Bassett (mebassett at gegn.net)
Date: Thu May 2 13:42:56 EDT 2013

Hi all,

It might be better to send this to dev at racket-lang.  Then again, it 
might be completely useless to them.

So we have a job scheduler program written in racket that handles 
various places and tcp clients.  This program sporadically and 
inconsistently terminates with the following error message:

SIGSEGV MAPERR si_code 1 fault on addr 0x7fffb044ef48

We've caught the error at various different point of execution, but 
can't consistently reproduce it (yet).  We do have a core dump of the 
program running and terminating from the racket repl (loaded with 
"enter!") v 5.3.3.  I've made it available via dropbox at 
https://www.dropbox.com/s/rkd6pl511acll2r/core.12346.gz.

I don't have much experience reading core dumps, but it looks to me 
like racket is hitting a stackoverflow in scheme_uncopy_stack in 
setjmpup.c.

In particular, at the first time scheme_uncopy_stack appears in the 
stack with args scheme_uncopy_stack (ok=0, b=0x7f857b2b3510, 
prev=0x7fffb044f5e0) we have:

>>(gdb) p prev
$1 = (intptr_t *) 0x7fffb044f5e0
>>(gdb) p *prev
$2 = 0
>>(gdb) p b
$3 = (Scheme_Jumpup_Buf *) 0x7f857b2b3510
>>(gdb) p *b
$4 = {stack_from = 0x0, stack_copy = 0x0, stack_size = 0, 
stack_max_size = 0, cont = 0x0, buf = {jb = {{jb = {{
             __jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, 
__saved_mask = {__val = {
                 0 <repeats 16 times>}}}}, stack_frame = 0}}, gcvs = 0, 
gcvs_cnt = 0}, gc_var_stack = 0x0,
   external_stack = 0x0}

scheme_uncopy_stack remains in the stack for several thousand frames.

The racket interpreter was compiled from source (so I don't know if 
others can even read that coredump!) on a linux kernel 
3.4.37-40.44.amzn1.x86_64 #1 SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 
x86_64 x86_64 GNU/Linux with glibc/-devel 
glibc-2.12-1.107.43.amzn1.x86_64.

I was able to capture the same error running from the MacOSX binaries 
5.3.3 from racket-lang.org.  That core dump is available at 
https://www.dropbox.com/s/jfneqr4zlkmkjhh/core.41166.gz.


Is this an error in racket?  IF not, do you have any suggestions on how 
I can proceed in debugging this (I'm at a loss?) or even to figure out 
which bits of my racket code to look at?  (I've tried doing "info 
locals" from gdb at various points in the stack, but I've not reached 
enlightenment.  Again, I have little experience with reading core 
dumps).

I've not included any of our racket code, as we don't know which part 
is causing the problem.

Thanks for reading,

--
Matthew Eric Bassett | http://mebassett.info

Posted on the users mailing list.