[racket] debugging core dump - comments appreciated

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Mon May 23 22:47:55 EDT 2011

At Mon, 23 May 2011 22:26:20 -0400, Neil Van Dyke wrote:
> Matthew Flatt wrote at 05/23/2011 10:11 PM:
> > At Mon, 23 May 2011 22:01:31 -0400, Neil Van Dyke wrote:
> >   
> >> We're not explicitly setting any stack limits anywhere.  I believe but 
> >> am not certain that that core dump came from a "mzscheme -jqr" from 
> >> inside an Apache CGI context that got a native stack ulimit of 8192 kB 
> >> (the normal limit on that machine).  Shall I confirm this?
> >>     
> >
> > Maybe, but I've become more interested in the possibility that other OS
> > threads might have crashed. Does `info threads' work in gdb with a core
> > file?
> >   
> 
> I'm not certain "gdb" is accurate here, but I don't think that any C 
> code we use introduces any additional OS threads.
> 
> #0  0x00000000005655b6 in GC_clear_stack_inner (arg=0x0, 
> limit=0x7fff2dd5ce30 <Address 0x7fff2dd5ce30 out of bounds>) at ./misc.c:243
> 243    ./misc.c: No such file or directory.
>     in ./misc.c
> (gdb) info threads
>   2 process 28526  0x00007fff316fcbe1 in nanosleep () from /lib/libc.so.6
> * 1 process 28525  0x00000000005655b6 in GC_clear_stack_inner (arg=0x0, 
> limit=0x7fff2dd5ce30 <Address 0x7fff2dd5ce30 out of bounds>) at ./misc.c:243

That looks right. The nanosleep() thread is there to trigger a
Racket-thread switch every 100ms or so, but it's apparently not
crashing in the attempt.

> >> Could code evaluated at module load time, such as "make-standard-set" 
> >> (which has some non-tail calls in loops, I don't know the size), be 
> >> using lots of stack, and, once every 100,000 runs of a large program, 
> >> combines with nondeterministic GC behavior and a bug to cause a seg fault?
> >>     
> >
> > It seems unlikely that any module is using lots of C stack relative to
> > 8MB, so I think we must be missing something simpler. Nondeterministic
> > GC behavior seems like a likely part of the puzzle, though.
> >   
> 
> (I'm not sure whether we're talking about a Scheme stack that is 
> different than the native stack)  Could we be having an overly large 
> stack quite often, and the rareness of the crash is only because usually 
> the stack does not collide with non-stack memory in a detectable way?

Neither the C stack or Scheme stack (yes, they are separate) seems
particularly large. There's one overflow of the Scheme stack, but
that's not surprising since it starts small and grows on demand.


I guess we're back to checking on the stack size. Maybe also
disassemble GC_clear_stack_inner() so we can be clear on what
part of the function is crashing?

Thanks,
Matthew



Posted on the users mailing list.