[racket-dev] can we write these four lines of C in performant racket?

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Mon Jul 25 07:51:59 EDT 2011

At Sat, 23 Jul 2011 14:42:15 -0400, John Clements wrote:
> This C code adds the content of one buffer to another one, with no checking.  
> The corresponding racket code runs about 10x slower. Do you folks think that it 
> should be possible to do better? (One salient fact: these are 
> shorts--16-bit-ints--not 32-bit ints.)

My sources say no --- at least not a lot better with the current JIT.

> void addOn(short *dst, int dstOffset, short *src, int srcOffset, int len) {
>   int i;
>   for (i = 0; i<len; i++) {
>     dst[i+dstOffset] += src[i+srcOffset];
>   }}

One problem is that a `s16-vector' in Racket is a struct that wraps an
FFI wrapper around the array pointer, instead of just an array pointer.
So, a C version of the Racket loop is more like

 void addOn(short_ptr *dst, int dstOffset, 
            short_ptr *src, int srcOffset, int len) {
   int i;
   for (i = 0; i<len; i++) {
     dst->a->a[i+dstOffset] += src->a->a[i+srcOffset];
   }
 }

Another problem is that the JIT doesn't use registers well enough, so
its code is closer in this case to unoptimized gcc output. Finally,
there's some extra overhead for the fixnum encoding.

Here are some timings for 1000 iterations on 2^20-element inputs
(32-bit mode, Mac Book Pro 2.53 GHz):

  C as above, gcc -02          :  1409
  C with indirections, gcc -O2 :  4041
  C as above, gcc -O0          :  6425
  C with indirections, gcc -O0 :  8480
  Racket fxvector (more direct):  8883
  Racket                       : 11248

I can tweak the JIT in small ways to make a small improvement:

  Racket with JIT tweaks       : 10670


The Racket code I tried is below (ugly to run as fast as I could make
it).

----------------------------------------

#lang racket/base
(require racket/unsafe/ops
         ffi/vector)

(define SIZE 1048576)

(define addOn #f) ; defeat inlining of `addOn'
(set! addOn
      (lambda (dst dstOffset src srcOffset len)
        (let loop ([i 0])
          (unless (eq? i len)
            (let ([d (unsafe-fx+ dstOffset i)])
              (unsafe-s16vector-set! 
               dst 
               d
               (unsafe-fx+
                (unsafe-s16vector-ref dst d)
                (unsafe-s16vector-ref src (unsafe-fx+ srcOffset i)))))
            (loop (unsafe-fx+ i 1))))))

(let ([a (make-s16vector SIZE)]
      [b (make-s16vector SIZE)])
  (time
   (for ([i (in-range 1000)])
     (addOn a 0 b 0 SIZE))))



Posted on the dev mailing list.