[racket-dev] TR optimizations and Racket overhead [was Re: [plt] Push #28633: master branch updated]
On 04/28/2014 04:46 AM, Sam Tobin-Hochstadt wrote:
>
> On Apr 28, 2014 12:16 AM, "Neil Toronto" <neil.toronto at gmail.com
> <mailto:neil.toronto at gmail.com>> wrote:
> >
> > For anyone wondering what this stuff means: most mixed exact/inexact
> math in Racket should be up to 5x or so faster now, depending on how
> conversion-heavy it is and how large the numbers are. On my machine,
> `real->double-flonum` is 1x or 25x faster on exact rationals (1x if the
> numerator and denominator both fit in a flonum; we optimized the slow
> path), and `inexact->exact` is 4x-100x faster (depending on the
> magnitude of the flonum's exponent). Also, conversions from exact to
> inexact now always find the closest approximation, so e.g. `sqrt` is a
> little more accurate when given exact rationals.
>
> Sounds great.
>
> > FWIW, the C implementations aren't much faster than the Typed Racket
> prototypes. I find that totally awesome.
>
> Impressive! Can you share the prototype?
Attached.
> Also, what ends up slower? What would we have to fix to make them the
> same speed?
After looking into it more, I've found that TR's optimizer isn't helping
much, probably because big integer operations account for most of the
cost. Its largest effect seems to be reducing constant overhead in the
timing loops - which at least makes my analysis more accurate. :)
IOW, I think the prototypes are fast mostly because Racket is fast.
There's probably not a lot more you can do to speed them up.
More details, as measured on my machine:
For rational-to-flonum conversions, the TR implementation is about 13%
slower than C. It uses a different method to determine whether it can
use the fast algorithm (i.e. divide two flonums), which requires an
extra conversion back to big integers, and this apparently accounts for
about 3% of the extra cost. Otherwise, it uses exactly the same
algorithm - it even mutates variables like in the C code.
It's harder to analyze flonum-to-rational conversions, because the TR
implementation has to use `real->floating-point-bytes` and
`integer-bytes->integer`, but the C code can get the flonum's bits using
a cast. When I fix the input to get rid of this variable, the TR
implementation is 2% slower on numbers with negative exponents, and 10%
*faster* on numbers with positive exponents. I suspect that when it's
faster, it's because the C code is checking more cases, because the test
compares `inexact->exact` with `flonum->rational`.
In all, in these two programs, I think Racket overhead accounts for up
to 10% of running time for TR-optimized code.
****
I think you might be interested to see how TR's optimizer does with code
that does a ridiculous amount of non-homogenous floating point. The
robust geometric primitives I'm working on are a perfect test case: I
use only one data structure besides short flvectors and flonums (a
plane, which contains one of each), and most computations are dense
sequences of flonum arithmetic that doesn't just map one operation over
a vector.
1 million applications on typical domain, in ms
Function #:no-optimize opt'd frac
-------------------------------------------------
flv3mag^2 153 97 63%
flv3mag 273 192 70%
flv3dot 185 112 61%
flv3cross 386 273 71%
flv3normalize 830 571 69%
flv3dist^2 190 132 69%
flv3dist 311 220 71%
flv3polygon-perp 1542 1136 74%
flv3polygon-normal 2370 1710 72%
flv3polygon-centroid 805 550 68%
flplane3-point-dist 334 236 71%
flplane3-line-isect-time 1102 955 87%
flplane3-line-isect (seg) 747 533 71%
flplane3-line-isect (no-seg) 1502 1079 72%
So I can get about 9.8 million robust dot products per second out of TR,
and 5.4 million out of Racket.
These functions return results with <= 0.5ulp error by using
double-double representation (i.e. using two flonums per number in
intermediate results). This makes them about 5x slower than
straightforward implementations. I could probably get 45 million dirty
dot products per second out of TR and 27 million out of Racket.
From these two cases (flonum/rational conversion and geometric
primitives) and other math-y things I've done in TR and Racket, I
conclude that
1. When dealing with platform data types like flonums and small
integers, it's not too hard to write TR code that's close to C in
performance.
2. When dealing with non-platform data types like big integers and
exact rationals, even unoptimized Racket is close to C in
performance.
I can't help but wonder how many of Racket's numeric primitives we could
move into TR, and what it would take to do so.
Neil ⊥
-------------- next part --------------
#lang typed/racket
(require math/flonum)
(define flonum-implicit-one (arithmetic-shift 1 52))
(define flonum-subnormal-exp (arithmetic-shift 1 1074))
(define foo (let ([x (random 51239)])
(- x x)))
(: flonum->rational (-> Flonum Exact-Rational))
(define (flonum->rational x)
;; bits 63, 52-62, and 0-51
(define-values (s e m) (flonum->fields x))
(define r
(cond [(= e 2047)
;; Special numbers aren't rational
(raise-type-error 'flonum->rational "rational Flonum" x)]
[(zero? e)
;; Subnormal numbers all have the same exponent, 2^-1074 = +min.0
(/ m flonum-subnormal-exp)]
[else
;; Adjust exponent for bias and mantissa size; add implicit one bit to mantissa
(let ([e (- e (+ 1023 52))]
[m (bitwise-ior m flonum-implicit-one)])
;; Compute m * 2^e
(if (< e 0)
(/ m (arithmetic-shift 1 (- e)))
(arithmetic-shift m e)))]))
(if (zero? s) r (- r)))
;; ===================================================================================================
;; Tests
(require math/base)
(define +one (flonum->ordinal 1.0))
(define -max (flonum->ordinal -max.0))
(define +inf (flonum->ordinal +inf.0))
(define (random-flonum)
(ordinal->flonum (random-integer -max +inf)))
(: xs (Listof Flonum))
(define xs (list* -max.0 -1.0 -min.0 -0.0 0.0 +min.0 1.0 +max.0
(build-list 1000000 (λ (_) (random-flonum)))))
(define: bx : Any #f)
(printf "Randomized testing of flonum->rational~n")
(time
(for ([x (in-list xs)])
(unless (= (flonum->rational x) (inexact->exact x))
(error 'bad "bad: ~v" x))))
(newline)
(printf "Speed test: inexact->exact~n")
(for ([_ (in-range 5)])
(time (for ([x (in-list xs)])
(set! bx (inexact->exact x)))))
(newline)
(printf "Speed test: flonum->rational~n")
(for ([_ (in-range 5)])
(time (for ([x (in-list xs)])
(set! bx (flonum->rational x)))))
(newline)
-------------- next part --------------
#lang typed/racket
(require racket/flonum)
(: make-flonum (-> Integer Integer Flonum))
(define (make-flonum e m)
(fl* (->fl m) (flexpt 2.0 (->fl e))))
(: rounded-div (-> Natural Natural Natural))
;; (rounded-div n d) is equivalent to (round (/ n d)), but a lot faster
(define (rounded-div n d)
(define-values (i f) (quotient/remainder n d))
(define d/2 (arithmetic-shift d -1))
(cond [(< f d/2) i]
[(> f d/2) (+ i 1)]
[(or (odd? d) (even? i)) i]
[else (+ i 1)]))
(define FLOAT_E_MIN -1074)
(define FLOAT_M_BITS 52)
(: rational->flonum (-> Exact-Rational Flonum))
(define (rational->flonum q)
(cond
[(negative? q) (- (rational->flonum (- q)))]
[else
(define n (numerator q))
(define d (denominator q))
(define fn (->fl n))
(define fd (->fl d))
(cond
[(and (< (abs fn) +inf.0) (= n (fl->exact-integer fn))
(< (abs fd) +inf.0) (= d (fl->exact-integer fd)))
;; Numerator and denominator are exactly represented by flonums, so correctly rounded
;; division implies this result has no more than 0.5ulp error
(/ fn fd)]
[else
;; The following code was shamelessly stolen from racket/src/racket/src/rational.c
(define nl (integer-length n))
(define dl (integer-length d))
(define p (- nl dl))
(cond [(< p 0) (set! n (arithmetic-shift n (- p)))]
[else (set! d (arithmetic-shift d p))])
(when (< n d)
(set! n (arithmetic-shift n 1))
(set! p (- p 1)))
(define shift (- p FLOAT_E_MIN))
(when (> shift FLOAT_M_BITS)
(set! shift FLOAT_M_BITS))
(set! n (arithmetic-shift n shift))
;; But this part is new:
(set! n (rounded-div (abs n) (abs d)))
(make-flonum (- p shift) n)])]))
;; ===================================================================================================
;; Tests
(require typed/rackunit
math/base
math/flonum)
(printf "Testing rounded-div~n")
(for ([n (in-range 0 16)]
[d (in-range 1 16)])
(check-= (round (/ n d)) (rounded-div (abs n) (abs d)) 0))
(printf "Testing single cases~n")
;; Sanity check
(check-= (rational->flonum 1/7) 0.14285714285714285 0)
(check-= (rational->flonum 9/7) 1.2857142857142858 0)
;; Cases that real->double-flonum gets wrong
(check-= (rational->flonum -13737024017780747/2813507303900) -4882.526517254422 0)
(check-= (rational->flonum -1656/16910305547451097) -9.792844933246106e-14 0)
(newline)
(: random-rational (-> Exact-Rational))
(define (random-rational)
(define d (random-bits (+ 1 (random 8192))))
(cond [(zero? d) (random-rational)]
[else
(* (if (< (random) 0.5) -1 1)
(/ (random-bits (+ 1 (random 8192))) d))]))
(printf "Randomized testing of rational->flonum~n")
(define es
(filter
(λ (v) v)
(for/list : (Listof Any) ([_ (in-range 10000)])
(define ry (random-rational))
(define y (rational->flonum ry))
(define e (flulp-error y ry))
(if (e . > . 0.5) (list e ry) #f))))
(check-equal? es empty)
(newline)
(printf "Speed test: 100000 conversions of large rationals~n")
(define ns (build-list 10000 (λ (_) (random-bits 8192))))
(define ds (build-list 10000 (λ (_) (random-bits 8192))))
(define qs (map / ns ds))
(define bx (flvector 0.0))
(printf "real->double-flonum~n")
(for ([_ (in-range 5)])
(time (for* ([_ (in-range 10)]
[q (in-list qs)])
(flvector-set! bx 0 (real->double-flonum q)))))
(newline)
(printf "rational->flonum~n")
(for ([_ (in-range 5)])
(time (for* ([_ (in-range 10)]
[q (in-list qs)])
(flvector-set! bx 0 (rational->flonum q)))))
(newline)