[racket-dev] TR optimizations and Racket overhead [was Re: [plt] Push #28633: master branch updated]

From: Neil Toronto (neil.toronto at gmail.com)
Date: Mon Apr 28 16:38:01 EDT 2014

On 04/28/2014 04:46 AM, Sam Tobin-Hochstadt wrote:
> On Apr 28, 2014 12:16 AM, "Neil Toronto" <neil.toronto at gmail.com
> <mailto:neil.toronto at gmail.com>> wrote:
>  >
>  > For anyone wondering what this stuff means: most mixed exact/inexact
> math in Racket should be up to 5x or so faster now, depending on how
> conversion-heavy it is and how large the numbers are. On my machine,
> `real->double-flonum` is 1x or 25x faster on exact rationals (1x if the
> numerator and denominator both fit in a flonum; we optimized the slow
> path), and `inexact->exact` is 4x-100x faster (depending on the
> magnitude of the flonum's exponent). Also, conversions from exact to
> inexact now always find the closest approximation, so e.g. `sqrt` is a
> little more accurate when given exact rationals.
> Sounds great.
>  > FWIW, the C implementations aren't much faster than the Typed Racket
> prototypes. I find that totally awesome.
> Impressive! Can you share the prototype?


> Also, what ends up slower? What would we have to fix to make them the
> same speed?

After looking into it more, I've found that TR's optimizer isn't helping 
much, probably because big integer operations account for most of the 
cost. Its largest effect seems to be reducing constant overhead in the 
timing loops - which at least makes my analysis more accurate. :)

IOW, I think the prototypes are fast mostly because Racket is fast. 
There's probably not a lot more you can do to speed them up.

More details, as measured on my machine:

For rational-to-flonum conversions, the TR implementation is about 13% 
slower than C. It uses a different method to determine whether it can 
use the fast algorithm (i.e. divide two flonums), which requires an 
extra conversion back to big integers, and this apparently accounts for 
about 3% of the extra cost. Otherwise, it uses exactly the same 
algorithm - it even mutates variables like in the C code.

It's harder to analyze flonum-to-rational conversions, because the TR 
implementation has to use `real->floating-point-bytes` and 
`integer-bytes->integer`, but the C code can get the flonum's bits using 
a cast. When I fix the input to get rid of this variable, the TR 
implementation is 2% slower on numbers with negative exponents, and 10% 
*faster* on numbers with positive exponents. I suspect that when it's 
faster, it's because the C code is checking more cases, because the test 
compares `inexact->exact` with `flonum->rational`.

In all, in these two programs, I think Racket overhead accounts for up 
to 10% of running time for TR-optimized code.


I think you might be interested to see how TR's optimizer does with code 
that does a ridiculous amount of non-homogenous floating point. The 
robust geometric primitives I'm working on are a perfect test case: I 
use only one data structure besides short flvectors and flonums (a 
plane, which contains one of each), and most computations are dense 
sequences of flonum arithmetic that doesn't just map one operation over 
a vector.

    1 million applications on typical domain, in ms

   Function              #:no-optimize  opt'd   frac
   flv3mag^2                      153     97     63%
   flv3mag                        273    192     70%
   flv3dot                        185    112     61%
   flv3cross                      386    273     71%
   flv3normalize                  830    571     69%
   flv3dist^2                     190    132     69%
   flv3dist                       311    220     71%
   flv3polygon-perp              1542   1136     74%
   flv3polygon-normal            2370   1710     72%
   flv3polygon-centroid           805    550     68%
   flplane3-point-dist            334    236     71%
   flplane3-line-isect-time      1102    955     87%
   flplane3-line-isect (seg)      747    533     71%
   flplane3-line-isect (no-seg)  1502   1079     72%

So I can get about 9.8 million robust dot products per second out of TR, 
and 5.4 million out of Racket.

These functions return results with <= 0.5ulp error by using 
double-double representation (i.e. using two flonums per number in 
intermediate results). This makes them about 5x slower than 
straightforward implementations. I could probably get 45 million dirty 
dot products per second out of TR and 27 million out of Racket.

 From these two cases (flonum/rational conversion and geometric 
primitives) and other math-y things I've done in TR and Racket, I 
conclude that

  1. When dealing with platform data types like flonums and small
     integers, it's not too hard to write TR code that's close to C in

  2. When dealing with non-platform data types like big integers and
     exact rationals, even unoptimized Racket is close to C in

I can't help but wonder how many of Racket's numeric primitives we could 
move into TR, and what it would take to do so.

Neil ⊥

-------------- next part --------------
#lang typed/racket

(require math/flonum)

(define flonum-implicit-one (arithmetic-shift 1 52))
(define flonum-subnormal-exp (arithmetic-shift 1 1074))

(define foo (let ([x  (random 51239)])
              (- x x)))

(: flonum->rational (-> Flonum Exact-Rational))
(define (flonum->rational x)
  ;; bits 63, 52-62, and 0-51
  (define-values (s e m) (flonum->fields x))
  (define r
    (cond [(= e 2047)
           ;; Special numbers aren't rational
           (raise-type-error 'flonum->rational "rational Flonum" x)]
          [(zero? e)
           ;; Subnormal numbers all have the same exponent, 2^-1074 = +min.0
           (/ m flonum-subnormal-exp)]
           ;; Adjust exponent for bias and mantissa size; add implicit one bit to mantissa
           (let ([e  (- e (+ 1023 52))]
                 [m  (bitwise-ior m flonum-implicit-one)])
             ;; Compute m * 2^e
             (if (< e 0)
                 (/ m (arithmetic-shift 1 (- e)))
                 (arithmetic-shift m e)))]))
  (if (zero? s) r (- r)))

;; ===================================================================================================
;; Tests

(require math/base)

(define +one (flonum->ordinal 1.0))
(define -max (flonum->ordinal -max.0))
(define +inf (flonum->ordinal +inf.0))
(define (random-flonum)
  (ordinal->flonum (random-integer -max +inf)))

(: xs (Listof Flonum))
(define xs (list* -max.0 -1.0 -min.0 -0.0 0.0 +min.0 1.0 +max.0
                  (build-list 1000000 (λ (_) (random-flonum)))))
(define: bx : Any  #f)

(printf "Randomized testing of flonum->rational~n")
 (for ([x  (in-list xs)])
   (unless (= (flonum->rational x) (inexact->exact x))
     (error 'bad "bad: ~v" x))))

(printf "Speed test: inexact->exact~n")
(for ([_  (in-range 5)])
  (time (for ([x  (in-list xs)])
          (set! bx (inexact->exact x)))))

(printf "Speed test: flonum->rational~n")
(for ([_  (in-range 5)])
  (time (for ([x  (in-list xs)])
          (set! bx (flonum->rational x)))))
-------------- next part --------------
#lang typed/racket

(require racket/flonum)

(: make-flonum (-> Integer Integer Flonum))
(define (make-flonum e m)
  (fl* (->fl m) (flexpt 2.0 (->fl e))))

(: rounded-div (-> Natural Natural Natural))
;; (rounded-div n d) is equivalent to (round (/ n d)), but a lot faster
(define (rounded-div n d)
  (define-values (i f) (quotient/remainder n d))
  (define d/2 (arithmetic-shift d -1))
  (cond [(< f d/2)  i]
        [(> f d/2)  (+ i 1)]
        [(or (odd? d) (even? i))  i]
        [else  (+ i 1)]))

(define FLOAT_E_MIN -1074)
(define FLOAT_M_BITS 52)

(: rational->flonum (-> Exact-Rational Flonum))
(define (rational->flonum q)
    [(negative? q)  (- (rational->flonum (- q)))]
     (define n (numerator q))
     (define d (denominator q))
     (define fn (->fl n))
     (define fd (->fl d))
       [(and (< (abs fn) +inf.0) (= n (fl->exact-integer fn))
             (< (abs fd) +inf.0) (= d (fl->exact-integer fd)))
        ;; Numerator and denominator are exactly represented by flonums, so correctly rounded
        ;; division implies this result has no more than 0.5ulp error
        (/ fn fd)]
        ;; The following code was shamelessly stolen from racket/src/racket/src/rational.c
        (define nl (integer-length n))
        (define dl (integer-length d))
        (define p (- nl dl))
        (cond [(< p 0)  (set! n (arithmetic-shift n (- p)))]
              [else     (set! d (arithmetic-shift d p))])
        (when (< n d)
          (set! n (arithmetic-shift n 1))
          (set! p (- p 1)))
        (define shift (- p FLOAT_E_MIN))
        (when (> shift FLOAT_M_BITS)
          (set! shift FLOAT_M_BITS))
        (set! n (arithmetic-shift n shift))
        ;; But this part is new:
        (set! n (rounded-div (abs n) (abs d)))
        (make-flonum (- p shift) n)])]))

;; ===================================================================================================
;; Tests

(require typed/rackunit

(printf "Testing rounded-div~n")
(for ([n  (in-range 0 16)]
      [d  (in-range 1 16)])
  (check-= (round (/ n d)) (rounded-div (abs n) (abs d)) 0))

(printf "Testing single cases~n")
;; Sanity check
(check-= (rational->flonum 1/7) 0.14285714285714285 0)
(check-= (rational->flonum 9/7) 1.2857142857142858 0)
;; Cases that real->double-flonum gets wrong
(check-= (rational->flonum -13737024017780747/2813507303900) -4882.526517254422 0)
(check-= (rational->flonum -1656/16910305547451097) -9.792844933246106e-14 0)

(: random-rational (-> Exact-Rational))
(define (random-rational)
  (define d (random-bits (+ 1 (random 8192))))
  (cond [(zero? d)  (random-rational)]
         (* (if (< (random) 0.5) -1 1)
            (/ (random-bits (+ 1 (random 8192))) d))]))

(printf "Randomized testing of rational->flonum~n")
(define es
   (λ (v) v)
   (for/list : (Listof Any) ([_  (in-range 10000)])
     (define ry (random-rational))
     (define y (rational->flonum ry))
     (define e (flulp-error y ry))
     (if (e . > . 0.5) (list e ry) #f))))
(check-equal? es empty)

(printf "Speed test: 100000 conversions of large rationals~n")
(define ns (build-list 10000 (λ (_) (random-bits 8192))))
(define ds (build-list 10000 (λ (_) (random-bits 8192))))
(define qs (map / ns ds))

(define bx (flvector 0.0))

(printf "real->double-flonum~n")
(for ([_  (in-range 5)])
  (time (for* ([_  (in-range 10)]
               [q  (in-list qs)])
          (flvector-set! bx 0 (real->double-flonum q)))))

(printf "rational->flonum~n")
(for ([_  (in-range 5)])
  (time (for* ([_  (in-range 10)]
               [q  (in-list qs)])
          (flvector-set! bx 0 (rational->flonum q)))))

Posted on the dev mailing list.