[plt-scheme] faster floating-point arithmetic
In v4.2.3.8, the new `scheme/flonum' library provides operations that
are specific to inexact real numbers, which are also known as
"flonums". Flonum-specific arithmetic and comparison operations include
`fl+' and `fl<'. There's also a `flvector' type that is like a vector
but contains only flonums.
Flonum-specific operations essentially act as a hint to the bytecode
and JIT compilers to improve the performance of floating-point
computations. Further performance may be available using unsafe
operations (from `scheme/unsafe/ops'), but safe flonum-specific
operations provide most of the performance benefits.
For example, here's a "convolution" computation that can be
instantiated with generic vectors and arithmetic or with
flonum vectors and arithmetic:
(define-syntax-rule (conv signal ; vector argument 1
kernel ; vector argument 2
mk ; make-vector
vl ; vector-length
vr vs! ; vector-{ref,set!}
fl+ fl* ; ops on vector elements
fx+ fx- fx= fx<=) ; ops on vector indices
(let ([result (mk (fx+ (fx+ (vl signal)
(vl kernel))
-1))]
[klen (vl kernel)]
[slen-1 (fx- (vl signal) 1)])
(for ([i (in-range 0 (vl result))])
(let loop ([accum 0.0][j 0])
(if (fx= j klen)
(vs! result i accum)
(let ([k (fx- (fx+ (fx+ i j) 1)
klen)])
(if (and (fx<= 0 k) (fx<= k slen-1))
(loop (fl+ accum
(fl* (vr kernel j)
(vr signal k)))
(fx+ j 1))
(loop accum (fx+ j 1)))))))
result))
Here's the generic instantiation:
(define (conv/generic signal kernel)
(conv signal kernel
make-vector vector-length
vector-ref vector-set!
+ *
+ - = <=))
A safe version that uses flonum-specific operations:
(define (conv/fl signal kernel)
(conv signal kernel
make-flvector flvector-length
flvector-ref flvector-set!
fl+ fl*
+ - = <=))
Using unsafe flonum operations:
(define (conv/unsafe signal kernel)
(conv signal kernel
make-flvector unsafe-flvector-length
unsafe-flvector-ref unsafe-flvector-set!
unsafe-fl+ unsafe-fl*
+ - = <=))
Using unsafe fixnum operations, too:
(define (conv/all-unsafe signal kernel)
(conv signal kernel
make-flvector unsafe-flvector-length
unsafe-flvector-ref unsafe-flvector-set!
unsafe-fl+ unsafe-fl*
unsafe-fx+ unsafe-fx- unsafe-fx= unsafe-fx<=))
Timings (x86, 2.53 GHz):
conv/generic cpu time: 2666 real time: 2679 gc time: 1099
conv/fl cpu time: 930 real time: 936 gc time: 30
conv/unsafe cpu time: 693 real time: 693 gc time: 2
conv/all-unsafe cpu time: 462 real time: 463 gc time: 1
Some other examples from collects/tests/mzscheme/benchmarks/shootout
(cpu time in msec on x86):
generic flonum all-unsafe
mandelbrot 1000 2604 908 792
spectralnorm 1000 7148 1953 1991
nbody 1000000 5407 3079 -
nbody vec 1000000 4491 2121 1112
Most of the performance benefit from `fl' operations is in unboxing
intermediate results (i.e., avoiding the allocation of a wrapper to
represent a floating-point value as a Scheme value). For example, in
(fl/ (fl* 2.0 (->fl x)) (->fl n))
the compiler can see that the reslt of `fl*' and the second `->fl'
will be used only as arguments to `fl/', so they can be used directly
by `fl/' instead of boxed and unboxed. Of course, the compler also
avoids any run-time check on the arguments to `fl/', since there's no
wrapper to check.
In
(define (mandelbrot x y n ci)
(let ((cr (fl- (fl/ (fl* 2.0 (->fl x)) (->fl n)) 1.5)))
(let loop ((i 0) (zr 0.0) (zi 0.0))
(if (> i +iterations+)
1
(let ([zr^2 (fl* zr zr)]
[zi^2 (fl* zi zi)])
(cond
((fl> (fl+ zr^2 zi^2) +limit-sqr+) 0)
(else (loop (+ 1 i)
(fl+ (fl- zr^2 zi^2) cr)
(fl+ (fl* 2.0 (fl* zr zi)) ci)))))))))
the compiler similarly sees that `zr^2' is bound to a flonum and used
as an argument to flonum-consuming procedures, so the result `(fl* zr
zr)' is put into local storage without an extra Scheme wrapper.
Finally, the compiler can see that `zr' and `zi' always hold flonums:
they are initialized to `0.0', and the self-call to `loop' provides
flonums. Again, an intermediate box can be avoided when jumping back to
the beginning of the loop. (The compiler will unroll the loop a few
times, though, so really it's the `let'-binding rule that applies for
most iterations.)
The current set of flonum-specific operations is not complete. We'll
eventually add trig functions, for example. There's also room for
further improvement in the compiler (as always!), but we have no
immediate plans on that front.
Better compilation of floating-point arithmetic was motivated partly
by work on futures:
http://list.cs.brown.edu/pipermail/plt-dev/2009-December/001761.html
When using futures to parallelize a numerical computation, the relative
penalty for boxing an intermediate result is even higher than in
sequential code; allocation leads to garbage collection, and garbage
collection stops all futures. With unboxed floating-point arithmetic,
futures can offer more consistent improvement for more numerical
programs.