[plt-scheme] Why is the JIT Written in C?

From: Will M. Farr (wmfarr at gmail.com)
Date: Tue Dec 1 13:46:32 EST 2009

Matthew (and others),

On Dec 1, 2009, at 12:25 PM, Matthew Flatt wrote:

> It's just how we've gotten from point A to point B. Adding a JIT that
> needed itself to run fast wouldn't have worked for us in practice,
> because it took (and still is taking) a while for the JIT to generate
> code that runs as fast as a JIT needs to run. That is, the overhead of
> a slow JIT would have prevented us from using the JIT to make
> incremental progress, and so on.

That makes a lot of sense---thanks for the explanation.  

> There's very little of our C code that really should be in C, though;
> your point is well taken. Unfortunately, it's unlikely that the JIT
> will migrate out of C soon, though, because we're currently more
> interested in other things, such as rebuilding the GUI (in Scheme
> instead of C++!).

Can I add that I think this is *extremely* cool?  (And, in the scheme of things, much more useful to the vast majority of users than re-writing the working JIT in Scheme.) 

> For example, the convolution code below runs within a factor of 4 of
> optimized C (compared to a factor of 15 for Scheme code using safe
> operations). In the JIT-generated code, the floating-point numbers are
> not boxed in the path from the input arrays to the output array.
> 
> There's still plenty of room for improvement in that factor of 4. About
> half of the difference is the representation of f64vectors (a pointer
> to a pointer to the data), and half of it the quality of the machine
> code generated by the JIT.
> 
> Having the JIT in a form where you could help imporve it would be
> ideal. Meanwhile, are there more things that a MzScheme JIT export
> could add to move toward things like `numeric-mode'?

I had not realized that there were unsafe-f64vector-ref and unsafe-f64vector-set!, or that the JIT would unbox compositions of the various unsafe operators.  Given this (which is already impressive), I would say that the two remaining steps to reach "OCaml-like" speed are 

* unboxing of floating-point values that are stored in purely local variables and used in only float expressions
* better code generation.

I'm guessing that the first is much easier than the second :).  For example, instead of the expression 

>              (vs! result
>                   i
>                   (fl+ (vr result i)
>                        (fl* (vr kernel j)
>                             (vr signal k))))

it would be nice to have 

(let ((result (vr result i))
      (kernel (vr kernel j))
      (signal (vr signal k)))
  (vs! result i (fl+ result (fl* kernel signal))))

with all the unboxing that appears in the uglier version (in addition, if you use the kernel and signal elements of the vector again, it's possible to reduce the f64vector "vr" overhead if these are stored unboxed in registers or on the stack).  Perhaps the JIT already does this, and I'm just ignorant?  

In general, would it be possible to guarantee the same unboxing rules as in OCaml (see http://caml.inria.fr/pub/old_caml_site/ocaml/numerical.html , in the section about boxing)?  This would probably make PLT Scheme "about" as fast as OCaml for numerical kernels, and OCaml is "about" as fast as C.  It would certainly become possible to consider writing a first-version of some computational code in Scheme without immediately dropping to C for the floating-point stuff to avoid horrendously long runtimes.  

Let me know if you would like to have me implement some simple (but "real") numerical codes in Scheme for benchmarks that you can use for feedback on useful optimizations of the JIT.  (i.e. a "full" N-body simulation, as opposed to a bunch of micro-benchmarks like FFT, convolution, the shootout N-body, matrix multiply etc.)

Thanks again for the quick response,
Will

Posted on the users mailing list.