[racket] Case study: using the profiler to trace a performance issue in compiler/zo-parse
Here's a small running "internal dialogue" to demonstrate how the use
of a profiler can help detect unexpected performance problems.
---
I was trying to trace a compilation performance issue in Whalesong.
One of the things I've added to help me chase this is an internal
debugging flag "--enable-profiling", that automatically runs Racket's
statistical profiler as it's doing its work.
So I run Whalesong on one of my programs and watch what comes out of
the profiler. From earlier discussion, I run Racket without the JIT
so that the profiler presents more useful information.
######
$ racket --no-jit ../../whalesong.rkt build --enable-profiling nucleic2.rkt
######
... lots and lots of profiler output starts printing to screen.
(It would be _very_ useful if the text renderer of the profiler were
sorted in terms of self time, by the way! Can this be changed?)
The profiler generates a sequence of "node" outputs, where each node
represents a function call and its context. I start scanning down the
"self" time column, which is how much time is being spent doing work
in a particular function.
One of the functions in the profile jumps out at me:
#######################################################################
Profiling results
-----------------
Total cpu time observed: 9144ms (out of 9268ms)
Number of samples taken: 3595 (once every 3ms)
...
[106] 1822(20.0%) 1822(20.0%) for-loop
/home/dyoo/local/racket-5.2.1/lib/racket/collects/compiler/zo-parse.rkt:490:2
#######################################################################
Oh! About twenty percent of the time is being spent here, in parsing
zo files! Whoa. Why? Let me take a look at that snippet of code.
Here's is the surrounding context:
;; in zo-parse.rkt ;;;;;;;;;;;;;;;;;
(define cpt-table
;; The "schcpt.h" mapping
`([0 escape]
[1 symbol]
;; ...
;; dyoo: omitting some output
;; ...
[247 255 small-application]))
(define (cpt-table-lookup i)
(for/or ([ent (in-list cpt-table)])
(match ent
[(list k sym)
(and (= k i) (cons k sym))]
[(list k k* sym)
(and (<= k i)
(< i k*)
(cons k sym))])))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
cpt-table-lookup defines the for-loop that the profiler reports as
being a hot-spot. Oh, wait! The function looks odd. What
cpt-table-lookup is doing is a lot of work, and I expected it to be a
simple, constant table lookup. (In a sense, it is "constant", but the
constant is larger than it should be.)
Since cpt-table is itself a constant value, that leaves
cpt-table-lookup ripe for an optimization: I can precompute a flat
lookup table. I'll fix cpt-table-lookup by creating a vector,
up-front, that flattens out the cpt-table. Here's what that looks
like:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(define cpt-table-as-vector (make-vector 256))
(for ([ent (in-list cpt-table)])
(match ent
[(list k sym)
(vector-set! cpt-table-as-vector k (cons k sym))]
[(list k k* sym)
(for ([i (in-range k k*)])
(vector-set! cpt-table-as-vector i (cons k sym)))]))
(define (cpt-table-lookup i)
(vector-ref cpt-table-as-vector i))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Ok, let's see what the profiler tells me now:
############################################
Profiling results
-----------------
Total cpu time observed: 7052ms (out of 7148ms)
Number of samples taken: 2812 (once every 3ms)
############################################
Sweet. That helped cut it down from 9 seconds to 7. The node
associated to the cpt-table-lookup function doesn't show up in the
profile anymore, either. About a 20% performance improvement. Cool!
The effect of this optimization isn't so pronounced when Racket is
running under its JIT. Under the JIT, my program took 4.3 seconds
before the optimization, and 3.9 seconds after the optimization,
giving me a 9% improvement. Still, not too bad.
I'll send a patch to compiler/zo-parse later. But I wanted to share
the importance of the profiler: I would not have easily pinpointed
zo-parse without it.