[plt-scheme] faster floating-point arithmetic

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Sat Dec 19 08:15:31 EST 2009

In v4.2.3.8, the new `scheme/flonum' library provides operations that
are specific to inexact real numbers, which are also known as
"flonums". Flonum-specific arithmetic and comparison operations include
`fl+' and `fl<'. There's also a `flvector' type that is like a vector
but contains only flonums.

Flonum-specific operations essentially act as a hint to the bytecode
and JIT compilers to improve the performance of floating-point
computations. Further performance may be available using unsafe
operations (from `scheme/unsafe/ops'), but safe flonum-specific
operations provide most of the performance benefits.

For example, here's a "convolution" computation that can be
instantiated with generic vectors and arithmetic or with
flonum vectors and arithmetic:

 (define-syntax-rule (conv signal            ; vector argument 1
                           kernel            ; vector argument 2
                           mk                ; make-vector 
                           vl                ; vector-length
                           vr vs!            ; vector-{ref,set!}
                           fl+ fl*           ; ops on vector elements
                           fx+ fx- fx= fx<=) ; ops on vector indices
   (let ([result (mk (fx+ (fx+ (vl signal)
                               (vl kernel))
                          -1))]
         [klen (vl kernel)]
         [slen-1 (fx- (vl signal) 1)])
     (for ([i (in-range 0 (vl result))])
       (let loop ([accum 0.0][j 0])
         (if (fx= j klen)
             (vs! result i accum)
             (let ([k (fx- (fx+ (fx+ i j) 1)
                           klen)])
               (if (and (fx<= 0 k) (fx<= k slen-1))
                   (loop (fl+ accum
                              (fl* (vr kernel j)
                                   (vr signal k)))
                         (fx+ j 1))
                   (loop accum (fx+ j 1)))))))
     result))

Here's the generic instantiation:

 (define (conv/generic signal kernel)
   (conv signal kernel
         make-vector vector-length 
         vector-ref vector-set!
         + *
         + - = <=))

A safe version that uses flonum-specific operations:

 (define (conv/fl signal kernel)
   (conv signal kernel
         make-flvector flvector-length 
         flvector-ref flvector-set!
         fl+ fl*
         + - = <=))

Using unsafe flonum operations:

 (define (conv/unsafe signal kernel)
   (conv signal kernel
         make-flvector unsafe-flvector-length 
         unsafe-flvector-ref unsafe-flvector-set!
         unsafe-fl+ unsafe-fl*
         + - = <=))

Using unsafe fixnum operations, too:

 (define (conv/all-unsafe signal kernel)
   (conv signal kernel
         make-flvector unsafe-flvector-length 
         unsafe-flvector-ref unsafe-flvector-set!
         unsafe-fl+ unsafe-fl*
         unsafe-fx+ unsafe-fx- unsafe-fx= unsafe-fx<=))

Timings (x86, 2.53 GHz):

 conv/generic    cpu time: 2666 real time: 2679 gc time: 1099
 conv/fl         cpu time: 930  real time: 936  gc time: 30
 conv/unsafe     cpu time: 693  real time: 693  gc time: 2
 conv/all-unsafe cpu time: 462  real time: 463  gc time: 1


Some other examples from collects/tests/mzscheme/benchmarks/shootout
(cpu time in msec on x86):

                    generic    flonum    all-unsafe
 mandelbrot 1000      2604        908           792
 spectralnorm 1000    7148       1953          1991
 nbody 1000000        5407       3079             -
 nbody vec 1000000    4491       2121          1112


Most of the performance benefit from `fl' operations is in unboxing
intermediate results (i.e., avoiding the allocation of a wrapper to
represent a floating-point value as a Scheme value). For example, in

      (fl/ (fl* 2.0 (->fl x)) (->fl n))

the compiler can see that the reslt of `fl*' and the second `->fl'
will be used only as arguments to `fl/', so they can be used directly
by `fl/' instead of boxed and unboxed. Of course, the compler also
avoids any run-time check on the arguments to `fl/', since there's no
wrapper to check.

In

 (define (mandelbrot x y n ci)
   (let ((cr (fl- (fl/ (fl* 2.0 (->fl x)) (->fl n)) 1.5)))
     (let loop ((i 0) (zr 0.0) (zi 0.0))
       (if (> i +iterations+)
           1
           (let ([zr^2 (fl* zr zr)]
                 [zi^2 (fl* zi zi)])
           (cond
            ((fl> (fl+ zr^2 zi^2) +limit-sqr+) 0)
            (else (loop (+ 1 i) 
                        (fl+ (fl- zr^2 zi^2) cr) 
                        (fl+ (fl* 2.0 (fl* zr zi)) ci)))))))))

the compiler similarly sees that `zr^2' is bound to a flonum and used
as an argument to flonum-consuming procedures, so the result `(fl* zr
zr)' is put into local storage without an extra Scheme wrapper.

Finally, the compiler can see that `zr' and `zi' always hold flonums:
they are initialized to `0.0', and the self-call to `loop' provides
flonums. Again, an intermediate box can be avoided when jumping back to
the beginning of the loop. (The compiler will unroll the loop a few
times, though, so really it's the `let'-binding rule that applies for
most iterations.)


The current set of flonum-specific operations is not complete. We'll
eventually add trig functions, for example. There's also room for
further improvement in the compiler (as always!), but we have no
immediate plans on that front.


Better compilation of floating-point arithmetic was motivated partly
by work on futures:

  http://list.cs.brown.edu/pipermail/plt-dev/2009-December/001761.html

When using futures to parallelize a numerical computation, the relative
penalty for boxing an intermediate result is even higher than in
sequential code; allocation leads to garbage collection, and garbage
collection stops all futures. With unboxed floating-point arithmetic,
futures can offer more consistent improvement for more numerical
programs.



Posted on the users mailing list.