[racket] Case study: using the profiler to trace a performance issue in compiler/zo-parse

From: Robby Findler (robby at eecs.northwestern.edu)
Date: Sat Mar 17 20:46:28 EDT 2012

I've had similarly happy experiences with the profiler when working on
Redex's pattern matcher. It has been great. (+1 on the sorting thing.)

Thanks, Eli.

Robby

On Sat, Mar 17, 2012 at 7:26 PM, Danny Yoo <dyoo at cs.wpi.edu> wrote:
> Here's a small running "internal dialogue" to demonstrate how the use
> of a profiler can help detect unexpected performance problems.
>
> ---
>
> I was trying to trace a compilation performance issue in Whalesong.
> One of the things I've added to help me chase this is an internal
> debugging flag "--enable-profiling", that automatically runs Racket's
> statistical profiler as it's doing its work.
>
>
> So I run Whalesong on one of my programs and watch what comes out of
> the profiler.  From earlier discussion, I run Racket without the JIT
> so that the profiler presents more useful information.
>
> ######
> $ racket --no-jit ../../whalesong.rkt build --enable-profiling nucleic2.rkt
> ######
>
> ... lots and lots of profiler output starts printing to screen.
>
> (It would be _very_ useful if the text renderer of the profiler were
> sorted in terms of self time, by the way!  Can this be changed?)
>
>
> The profiler generates a sequence of "node" outputs, where each node
> represents a function call and its context.  I start scanning down the
> "self" time column, which is how much time is being spent doing work
> in a particular function.
>
> One of the functions in the profile jumps out at me:
>
> #######################################################################
> Profiling results
> -----------------
>  Total cpu time observed: 9144ms (out of 9268ms)
>  Number of samples taken: 3595 (once every 3ms)
> ...
> [106] 1822(20.0%) 1822(20.0%) for-loop
> /home/dyoo/local/racket-5.2.1/lib/racket/collects/compiler/zo-parse.rkt:490:2
> #######################################################################
>
>
> Oh!  About twenty percent of the time is being spent here, in parsing
> zo files!  Whoa.  Why?  Let me take a look at that snippet of code.
> Here's is the surrounding context:
>
> ;; in zo-parse.rkt ;;;;;;;;;;;;;;;;;
> (define cpt-table
>  ;; The "schcpt.h" mapping
>  `([0  escape]
>    [1  symbol]
>    ;; ...
>    ;; dyoo: omitting some output
>    ;; ...
>    [247 255 small-application]))
>
> (define (cpt-table-lookup i)
>  (for/or ([ent (in-list cpt-table)])
>    (match ent
>      [(list k sym)
>       (and (= k i) (cons k sym))]
>      [(list k k* sym)
>       (and (<= k i)
>            (< i k*)
>            (cons k sym))])))
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
> cpt-table-lookup defines the for-loop that the profiler reports as
> being a hot-spot.  Oh, wait!  The function looks odd.  What
> cpt-table-lookup is doing is a lot of work, and I expected it to be a
> simple, constant table lookup.  (In a sense, it is "constant", but the
> constant is larger than it should be.)
>
>
> Since cpt-table is itself a constant value, that leaves
> cpt-table-lookup ripe for an optimization: I can precompute a flat
> lookup table.  I'll fix cpt-table-lookup by creating a vector,
> up-front, that flattens out the cpt-table.  Here's what that looks
> like:
>
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
> (define cpt-table-as-vector (make-vector 256))
> (for ([ent (in-list cpt-table)])
>  (match ent
>    [(list k sym)
>     (vector-set! cpt-table-as-vector k (cons k sym))]
>    [(list k k* sym)
>     (for ([i (in-range k k*)])
>       (vector-set! cpt-table-as-vector i (cons k sym)))]))
>
> (define (cpt-table-lookup i)
>  (vector-ref cpt-table-as-vector i))
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
>
> Ok, let's see what the profiler tells me now:
>
> ############################################
>
> Profiling results
> -----------------
>  Total cpu time observed: 7052ms (out of 7148ms)
>  Number of samples taken: 2812 (once every 3ms)
> ############################################
>
> Sweet.  That helped cut it down from 9 seconds to 7.  The node
> associated to the cpt-table-lookup function doesn't show up in the
> profile anymore, either.  About a 20% performance improvement.  Cool!
>
>
> The effect of this optimization isn't so pronounced when Racket is
> running under its JIT.  Under the JIT, my program took 4.3 seconds
> before the optimization, and 3.9 seconds after the optimization,
> giving me a 9% improvement.  Still, not too bad.
>
>
> I'll send a patch to compiler/zo-parse later.  But I wanted to share
> the importance of the profiler: I would not have easily pinpointed
> zo-parse without it.
> ____________________
>  Racket Users list:
>  http://lists.racket-lang.org/users


Posted on the users mailing list.