[racket] Using the profiler effectively?

From: Eli Barzilay (eli at barzilay.org)
Date: Sun Sep 4 20:28:23 EDT 2011

Just now, Danny Yoo wrote:
> 
> But I don't know how to interpret the results I'm getting back from
> the profiler yet.  I guess I'm trying to say: I'm illiterate: I
> don't know how to read the textual output of the profiler, and need
> some help.

The output is a conventional one.  (It's not completely compatible
with gprof (I've based it on the output from Allegro Lisp), but it's
close enough that it woudn't be hard to adapt it.)


>   Total cpu time observed: 26160ms (out of 26538ms)
>   Number of samples taken: 418 (once every 63ms)
>   (Hiding functions with self<1.0% and local<2.0%: 16 of 369 hidden)

This part should be obvious.  One potential problem  is that you have
a very small number of samples -- it would help to just run your code
mutiple times, or have more frequent sampling, so you get more data.


> [...]

It prints long lines, which your emailer butchered (and I tried to fix
below).

>                                  map [120]                  16.0%
>                                  loop [126]                 84.0%
>   [2]  1576(6.0%)      0(0.0%) parse-def-values ...parse-bytecode-5.1.2.rkt:273:3
>                                  parse-lam [8]              80.7%
>                                  parse-case-lam [10]         9.6%
>                                  parse-let-void [14]         6.3%
>                                  parse-with-cont-mark [154]  3.3%


So, this entry gives you information about `parse-def-values', which
is defined in that file, at 273:3.  (The "...s" are needed because
without them the lines would turn from inconveniently long to
problematically long; more below.)  Here are some random bits that
should explain the rest:

* It was running for 1576ms, which is 6.0% of the overall runtime.
  (Remember that "it was running" is an estimate, which is actually "I
  observed an active stack frame for this call N times which sum up to
  that time").

* It spent 0ms in its own code, which is 0.0% of its running time.  In
  this case, 0 means that this function was never at the top of the
  stack, which means that it's very unlikely to be a bottleneck.  To
  cut runtime costs, you'd be looking for functions with high such
  self times.  But the problem is that this is not a hard rule --
  because it might be that it called some other measured function
  which you can't optimize, so you'd need to reduce the number of
  calls you make for it...

* Overall, `map' was the direct caller for this function for 16% of
  its runtime, and `loop' was the caller for the other 84%.

* Obviously, `loop' is not a good enough name, but the "[126]" on its
  right is an identifying index into the results -- if you search for
  it, you'll see its entry with its source location (if it has one).
  Note this function (`parse-def-values') has a "[2]" label that
  identifies it for such occurrences elsewhere.  This is even more
  obviously useful for functions with no name.

* Overall, in 80.7% of this runtime the callee that this function
  invoked was `parse-lam' -- so if this function is a bottleneck, you
  should look there for what should be optimized.


>                                  compile-splice [3]     24.3%
>                                  compile [141]          75.7%
>   [3]   598(2.3%)     48(0.2%) compile-splice ...song/compiler/compiler.rkt:570:0
>                                  compile-def-values [9] 75.0%
>                                  compile-splice [3]      24.3%

This is an example for a function that did significant work in its own
body.  The callers part don't sum up to 100% since it called itself a
bunch of times and the same goes for the callee part.



> I'll cut off the output here, because there's pages and pages of
> this.  I don't know what I should be looking at.  Are the nodes
> being printed in any particular order?

I think that they're printed in a topological order, but I'm not sure.
There are some obvious tradeoffs here: if its sorted by the self
runtime (or any other measure) you get results that makes it easy to
find the function that spent the most self runtime.  But it's very
rarely that you'd want to do something in the function itself, you'd
more frequently want to deal with its callers and callees, and with
such a sort you'd end up jumping up and down the page to figure out
the flow.  Doing a topological sort makes that kind of stuff easier,
and I think that this is the right choice, since the numbers are
easier to glance and see "interesting" areas, unlike the topological
sort.

If you have ideas on how to improve the docs, feel free to suggest
patches...


> What else can I do to make the output sensical?  The fact that the
> function names are getting cut off with the '...'s is unhelpful.

Yeah, that's a problem too -- as I said above.  There are obvious
limitations for this kind of output, so one thing that you may want to
try is the graphviz renderer.  That would be limited too, and I think
that a real improvement would come in some interactive renderer --
either some GUI thing or some HTML thing:

* A gui thing would be great, since information on source location and
  generally more descriptive information can appear in popups etc.  I
  started on one once, and my intention was to do something like the
  graphviz layout, and that turned out to be (surprise) pretty hard to
  do.  I can probably send the code to anyone who's interested in
  continuing it.  Another option is to just let graphviz do the
  layout, making it accessible only if it's present.  BTW, the layout
  is important since without it you get an indecipherable pile of
  boxes that you need to sort out to see how things go -- and you need
  to do that after every optimization iteration which makes it
  impractical.  (Also, doing some naive layout like the module browser
  won't work, since you get much longer chains, and many of them side
  by side.)

* An HTML thing could start with the graphviz HTML mode which
  generates a picture and a corresponding image map.  The rest of the
  information could be done in a similar way to the GUI thing -- with
  popups etc.  The problem with that approach is that it's sometimes
  useful to reorganize things as you sort things out, and this
  wouldn't be available here without making up a solution that is
  equivalent to the gui (ie, let graphviz do the layout, write the
  rest in JS).

* Finally, another option that is good to explore is to see if there's
  some good visualization tool that work with gprof output or some
  other output (I'm sure there are), then write a renderer that
  produces that output.  IIRC, the actual gprof output is limited in
  that it hides some of the details, and I think that I saw some
  output format that is designed to provide more information in a
  machine parsable way.  BTW, it's easy to write such renderers, and
  it's even easier to write some trivial renderer that dumps the call
  chains that it observed -- in case you know of a tool that takes
  that as its input.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!


Posted on the users mailing list.