[plt-scheme] Interesting Seg-Fault Interfacing With C Code

From: Will Farr (farr at MIT.EDU)
Date: Mon Mar 23 23:02:35 EDT 2009

Hello,

Is it possible for a GC to run while in the dynamic extent of a C  
extension without a call to SCHEME_USE_FUEL(...), scheme_malloc_...(),  
scheme_...(), etc, under 3m?  I had what appeared to be a seg-fault  
related to 3m GC problems in interfacing with C code (I have now fixed  
the bug, as I will explain below), and one possible explanation is  
that some objects were being moved "between" calls into the runtime.

The code I had before fixing the bug looked like this:

void do_something() {
   /* Register stuff for GC. */

   /* Manipulate stuff a lot, calling SCHEME_USE_FUEL(1) occasionally.  
*/

   qsort(stuff, stuff_len, sizeof(stuff_t),  
function_which_does_not_call_scheme);

   /* Manipulate stuff some more, more SCHEME_USE_FUEL(1), etc. */

   /* Un-register stuff for GC. */
}	

and I would get seg-faults (within an hour of CPU time, but after  
*many* successful trips through the code), always inside qsort() (the  
C library function).  This was suspicious to me, because qsort was the  
only part of my code that was not instrumented by XForm.

Looking in GDB, the segfaults were due to following pointers outside  
allocated memory, exactly as if the memory for "stuff" had been moved  
by the GC during the extent of the qsort call.  However, accessing  
"stuff" immediately before and after qsort caused no problem (i.e.  
between the last SCHEME_USE_FUEL(...) call and qsort, I had valid  
data).  My impression was that "stuff" could only move when there was  
a call back into the runtime, not at random intervals, and certainly  
not in the dynamic extent of qsort() (as long as the comparison  
function was careful not to call into the runtime, and mine didn't).   
This situation occurred independently of whether I was manually  
marking pointers with the MZ_GC_REG() machinery, or letting XForm do  
it for me.

To fix the bug, I simply included some sort code I lifted out of the  
GSL (yes, my code will, if released, be covered by the GPL) in the  
library so that XForm can operate on it, too.  So now I call  
gsl_heapsort(...) instead of qsort(...), where the code for  
gsl_heapsort(...) has been XForm-ed.

Could it be possible that, on my platform (Mac OS X 10.5, x86, 2  
CPUs), something is going wrong with scheduling so that a GC could run  
even while I was in qsort(...), far away from any calls into the  
runtime?  Maybe one of the worker OS threads that MzScheme starts  
doesn't stay asleep like it should while my C code is running, and  
accidentally lets the GC run?  It is very strange that XForm  
instrumenting the sort code seems to have fixed my problem.  (I  
understand that these memory bugs are hard to test for, but the code  
has now been running successfully for three hours inside a MzScheme  
instance that calls (collect-garbage) in a separate thread every  
second---that looks like a "fix" to me.)

Does anybody have any thoughts on this?  Of course, since the bug is  
currently fixed, there is nothing urgent about this message, but I  
would like to know what the experts think about my situation.

Thanks,
Will


Posted on the users mailing list.