[plt-scheme] Interesting Seg-Fault Interfacing With C Code
Hello,
Is it possible for a GC to run while in the dynamic extent of a C
extension without a call to SCHEME_USE_FUEL(...), scheme_malloc_...(),
scheme_...(), etc, under 3m? I had what appeared to be a seg-fault
related to 3m GC problems in interfacing with C code (I have now fixed
the bug, as I will explain below), and one possible explanation is
that some objects were being moved "between" calls into the runtime.
The code I had before fixing the bug looked like this:
void do_something() {
/* Register stuff for GC. */
/* Manipulate stuff a lot, calling SCHEME_USE_FUEL(1) occasionally.
*/
qsort(stuff, stuff_len, sizeof(stuff_t),
function_which_does_not_call_scheme);
/* Manipulate stuff some more, more SCHEME_USE_FUEL(1), etc. */
/* Un-register stuff for GC. */
}
and I would get seg-faults (within an hour of CPU time, but after
*many* successful trips through the code), always inside qsort() (the
C library function). This was suspicious to me, because qsort was the
only part of my code that was not instrumented by XForm.
Looking in GDB, the segfaults were due to following pointers outside
allocated memory, exactly as if the memory for "stuff" had been moved
by the GC during the extent of the qsort call. However, accessing
"stuff" immediately before and after qsort caused no problem (i.e.
between the last SCHEME_USE_FUEL(...) call and qsort, I had valid
data). My impression was that "stuff" could only move when there was
a call back into the runtime, not at random intervals, and certainly
not in the dynamic extent of qsort() (as long as the comparison
function was careful not to call into the runtime, and mine didn't).
This situation occurred independently of whether I was manually
marking pointers with the MZ_GC_REG() machinery, or letting XForm do
it for me.
To fix the bug, I simply included some sort code I lifted out of the
GSL (yes, my code will, if released, be covered by the GPL) in the
library so that XForm can operate on it, too. So now I call
gsl_heapsort(...) instead of qsort(...), where the code for
gsl_heapsort(...) has been XForm-ed.
Could it be possible that, on my platform (Mac OS X 10.5, x86, 2
CPUs), something is going wrong with scheduling so that a GC could run
even while I was in qsort(...), far away from any calls into the
runtime? Maybe one of the worker OS threads that MzScheme starts
doesn't stay asleep like it should while my C code is running, and
accidentally lets the GC run? It is very strange that XForm
instrumenting the sort code seems to have fixed my problem. (I
understand that these memory bugs are hard to test for, but the code
has now been running successfully for three hours inside a MzScheme
instance that calls (collect-garbage) in a separate thread every
second---that looks like a "fix" to me.)
Does anybody have any thoughts on this? Of course, since the bug is
currently fixed, there is nothing urgent about this message, but I
would like to know what the experts think about my situation.
Thanks,
Will