[racket] very general reliability of old stuff question
If someone came to you and said, "We're using PLT 4.2.5 with CGC and
JIT, and we are wondering whether reliability would be improved by
moving to Racket 5.x and/or moving to 3m and/or disabling 4.2.5's JIT,"
what would you say?
Details... A big installation of PLT 4.2.5 (with CGC, and with JIT
enabled) has noticed a rare unexplained crash of the app. This is less
than 100.0000% reliability, which bothers us more than it would most
organizations. The app does still use old-style CGC C extension to call
one C library. The C library itself is widely used in industry, and it
not suspect. It's possible that the C extensions are doing something
wrong, although they have seemed solid for high volume for years, and
(though I did not write them myself) they seem to me to be doing the
right things for GC safety. It's also possible that the Scheme or C
code of the app is not handling all the conditions of the library
properly, and on rare occasions will use then use the library in an
invalid way, such as with a bad pointer or causing a vomit on the heap
or stack. This has occurred on multiple boring Linux servers, so
hardware is not suspect. We have not ruled out the possibility of a
freak bug in PLT.
We have set up core dumps and instrumented much of the code for detailed
logging, and attempting to stimulate the rare crash in a test
environment. We have also started some new rigorous analysis of the
bits of C code. But we're also wondering whether there are known
instability problems with the older PLT stuff we're using, and if we'd
be better off, *stability-wise*, moving to Racket 5.x, moving to 3m
(which probably means using FFI for our library, or replacing it with
pure Scheme Racket code), or disabling the 4.2.5 JIT.
--
http://www.neilvandyke.org/