[racket] very general reliability of old stuff question

From: Neil Van Dyke (neil at neilvandyke.org)
Date: Fri May 20 10:00:56 EDT 2011

If someone came to you and said, "We're using PLT 4.2.5 with CGC and 
JIT, and we are wondering whether reliability would be improved by 
moving to Racket 5.x and/or moving to 3m and/or disabling 4.2.5's JIT," 
what would you say?

Details... A big installation of PLT 4.2.5 (with CGC, and with JIT 
enabled) has noticed a rare unexplained crash of the app.  This is less 
than 100.0000% reliability, which bothers us more than it would most 
organizations.  The app does still use old-style CGC C extension to call 
one C library.  The C library itself is widely used in industry, and it 
not suspect.  It's possible that the C extensions are doing something 
wrong, although they have seemed solid for high volume for years, and 
(though I did not write them myself) they seem to me to be doing the 
right things for GC safety.  It's also possible that the Scheme or C 
code of the app is not handling all the conditions of the library 
properly, and on rare occasions will use then use the library in an 
invalid way, such as with a bad pointer or causing a vomit on the heap 
or stack.  This has occurred on multiple boring Linux servers, so 
hardware is not suspect.  We have not ruled out the possibility of a 
freak bug in PLT.

We have set up core dumps and instrumented much of the code for detailed 
logging, and attempting to stimulate the rare crash in a test 
environment.  We have also started some new rigorous analysis of the 
bits of C code.  But we're also wondering whether there are known 
instability problems with the older PLT stuff we're using, and if we'd 
be better off, *stability-wise*, moving to Racket 5.x, moving to 3m 
(which probably means using FFI for our library, or replacing it with 
pure Scheme Racket code), or disabling the 4.2.5 JIT.


Posted on the users mailing list.