[racket] very general reliability of old stuff question

From: D Herring (dherring at tentpost.com)
Date: Fri May 20 23:08:59 EDT 2011

On 05/20/2011 10:00 AM, Neil Van Dyke wrote:
> If someone came to you and said, "We're using PLT 4.2.5 with CGC and
> JIT, and we are wondering whether reliability would be improved by
> moving to Racket 5.x and/or moving to 3m and/or disabling 4.2.5's
> JIT," what would you say?
>
> Details... A big installation of PLT 4.2.5 (with CGC, and with JIT
> enabled) has noticed a rare unexplained crash of the app. This is less
> than 100.0000% reliability, which bothers us more than it would most
> organizations. The app does still use old-style CGC C extension to
> call one C library. The C library itself is widely used in industry,
> and it not suspect. It's possible that the C extensions are doing
> something wrong, although they have seemed solid for high volume for
> years, and (though I did not write them myself) they seem to me to be
> doing the right things for GC safety. It's also possible that the
> Scheme or C code of the app is not handling all the conditions of the
> library properly, and on rare occasions will use then use the library
> in an invalid way, such as with a bad pointer or causing a vomit on
> the heap or stack. This has occurred on multiple boring Linux servers,
> so hardware is not suspect. We have not ruled out the possibility of a
> freak bug in PLT.
>
> We have set up core dumps and instrumented much of the code for
> detailed logging, and attempting to stimulate the rare crash in a test
> environment. We have also started some new rigorous analysis of the
> bits of C code. But we're also wondering whether there are known
> instability problems with the older PLT stuff we're using, and if we'd
> be better off, *stability-wise*, moving to Racket 5.x, moving to 3m
> (which probably means using FFI for our library, or replacing it with
> pure Scheme Racket code), or disabling the 4.2.5 JIT.

Can't address your direct question, but here are some questions I've 
had luck with.
- are there multiple threads?  multiple processes?
- possible overflow somewhere?
- holding pointers after free?
- have you tried running under a tool like Valgrind or TotalView's 
memory debugger?

Later,
Daniel



Posted on the users mailing list.