[racket] debugging core dump - comments appreciated
At Mon, 23 May 2011 22:26:20 -0400, Neil Van Dyke wrote:
> Matthew Flatt wrote at 05/23/2011 10:11 PM:
> > At Mon, 23 May 2011 22:01:31 -0400, Neil Van Dyke wrote:
> >
> >> We're not explicitly setting any stack limits anywhere. I believe but
> >> am not certain that that core dump came from a "mzscheme -jqr" from
> >> inside an Apache CGI context that got a native stack ulimit of 8192 kB
> >> (the normal limit on that machine). Shall I confirm this?
> >>
> >
> > Maybe, but I've become more interested in the possibility that other OS
> > threads might have crashed. Does `info threads' work in gdb with a core
> > file?
> >
>
> I'm not certain "gdb" is accurate here, but I don't think that any C
> code we use introduces any additional OS threads.
>
> #0 0x00000000005655b6 in GC_clear_stack_inner (arg=0x0,
> limit=0x7fff2dd5ce30 <Address 0x7fff2dd5ce30 out of bounds>) at ./misc.c:243
> 243 ./misc.c: No such file or directory.
> in ./misc.c
> (gdb) info threads
> 2 process 28526 0x00007fff316fcbe1 in nanosleep () from /lib/libc.so.6
> * 1 process 28525 0x00000000005655b6 in GC_clear_stack_inner (arg=0x0,
> limit=0x7fff2dd5ce30 <Address 0x7fff2dd5ce30 out of bounds>) at ./misc.c:243
That looks right. The nanosleep() thread is there to trigger a
Racket-thread switch every 100ms or so, but it's apparently not
crashing in the attempt.
> >> Could code evaluated at module load time, such as "make-standard-set"
> >> (which has some non-tail calls in loops, I don't know the size), be
> >> using lots of stack, and, once every 100,000 runs of a large program,
> >> combines with nondeterministic GC behavior and a bug to cause a seg fault?
> >>
> >
> > It seems unlikely that any module is using lots of C stack relative to
> > 8MB, so I think we must be missing something simpler. Nondeterministic
> > GC behavior seems like a likely part of the puzzle, though.
> >
>
> (I'm not sure whether we're talking about a Scheme stack that is
> different than the native stack) Could we be having an overly large
> stack quite often, and the rareness of the crash is only because usually
> the stack does not collide with non-stack memory in a detectable way?
Neither the C stack or Scheme stack (yes, they are separate) seems
particularly large. There's one overflow of the Scheme stack, but
that's not surprising since it starts small and grows on demand.
I guess we're back to checking on the stack size. Maybe also
disassemble GC_clear_stack_inner() so we can be clear on what
part of the function is crashing?
Thanks,
Matthew