[racket] Using the profiler effectively?
Just now, Danny Yoo wrote:
>
> But I don't know how to interpret the results I'm getting back from
> the profiler yet. I guess I'm trying to say: I'm illiterate: I
> don't know how to read the textual output of the profiler, and need
> some help.
The output is a conventional one. (It's not completely compatible
with gprof (I've based it on the output from Allegro Lisp), but it's
close enough that it woudn't be hard to adapt it.)
> Total cpu time observed: 26160ms (out of 26538ms)
> Number of samples taken: 418 (once every 63ms)
> (Hiding functions with self<1.0% and local<2.0%: 16 of 369 hidden)
This part should be obvious. One potential problem is that you have
a very small number of samples -- it would help to just run your code
mutiple times, or have more frequent sampling, so you get more data.
> [...]
It prints long lines, which your emailer butchered (and I tried to fix
below).
> map [120] 16.0%
> loop [126] 84.0%
> [2] 1576(6.0%) 0(0.0%) parse-def-values ...parse-bytecode-5.1.2.rkt:273:3
> parse-lam [8] 80.7%
> parse-case-lam [10] 9.6%
> parse-let-void [14] 6.3%
> parse-with-cont-mark [154] 3.3%
So, this entry gives you information about `parse-def-values', which
is defined in that file, at 273:3. (The "...s" are needed because
without them the lines would turn from inconveniently long to
problematically long; more below.) Here are some random bits that
should explain the rest:
* It was running for 1576ms, which is 6.0% of the overall runtime.
(Remember that "it was running" is an estimate, which is actually "I
observed an active stack frame for this call N times which sum up to
that time").
* It spent 0ms in its own code, which is 0.0% of its running time. In
this case, 0 means that this function was never at the top of the
stack, which means that it's very unlikely to be a bottleneck. To
cut runtime costs, you'd be looking for functions with high such
self times. But the problem is that this is not a hard rule --
because it might be that it called some other measured function
which you can't optimize, so you'd need to reduce the number of
calls you make for it...
* Overall, `map' was the direct caller for this function for 16% of
its runtime, and `loop' was the caller for the other 84%.
* Obviously, `loop' is not a good enough name, but the "[126]" on its
right is an identifying index into the results -- if you search for
it, you'll see its entry with its source location (if it has one).
Note this function (`parse-def-values') has a "[2]" label that
identifies it for such occurrences elsewhere. This is even more
obviously useful for functions with no name.
* Overall, in 80.7% of this runtime the callee that this function
invoked was `parse-lam' -- so if this function is a bottleneck, you
should look there for what should be optimized.
> compile-splice [3] 24.3%
> compile [141] 75.7%
> [3] 598(2.3%) 48(0.2%) compile-splice ...song/compiler/compiler.rkt:570:0
> compile-def-values [9] 75.0%
> compile-splice [3] 24.3%
This is an example for a function that did significant work in its own
body. The callers part don't sum up to 100% since it called itself a
bunch of times and the same goes for the callee part.
> I'll cut off the output here, because there's pages and pages of
> this. I don't know what I should be looking at. Are the nodes
> being printed in any particular order?
I think that they're printed in a topological order, but I'm not sure.
There are some obvious tradeoffs here: if its sorted by the self
runtime (or any other measure) you get results that makes it easy to
find the function that spent the most self runtime. But it's very
rarely that you'd want to do something in the function itself, you'd
more frequently want to deal with its callers and callees, and with
such a sort you'd end up jumping up and down the page to figure out
the flow. Doing a topological sort makes that kind of stuff easier,
and I think that this is the right choice, since the numbers are
easier to glance and see "interesting" areas, unlike the topological
sort.
If you have ideas on how to improve the docs, feel free to suggest
patches...
> What else can I do to make the output sensical? The fact that the
> function names are getting cut off with the '...'s is unhelpful.
Yeah, that's a problem too -- as I said above. There are obvious
limitations for this kind of output, so one thing that you may want to
try is the graphviz renderer. That would be limited too, and I think
that a real improvement would come in some interactive renderer --
either some GUI thing or some HTML thing:
* A gui thing would be great, since information on source location and
generally more descriptive information can appear in popups etc. I
started on one once, and my intention was to do something like the
graphviz layout, and that turned out to be (surprise) pretty hard to
do. I can probably send the code to anyone who's interested in
continuing it. Another option is to just let graphviz do the
layout, making it accessible only if it's present. BTW, the layout
is important since without it you get an indecipherable pile of
boxes that you need to sort out to see how things go -- and you need
to do that after every optimization iteration which makes it
impractical. (Also, doing some naive layout like the module browser
won't work, since you get much longer chains, and many of them side
by side.)
* An HTML thing could start with the graphviz HTML mode which
generates a picture and a corresponding image map. The rest of the
information could be done in a similar way to the GUI thing -- with
popups etc. The problem with that approach is that it's sometimes
useful to reorganize things as you sort things out, and this
wouldn't be available here without making up a solution that is
equivalent to the gui (ie, let graphviz do the layout, write the
rest in JS).
* Finally, another option that is good to explore is to see if there's
some good visualization tool that work with gprof output or some
other output (I'm sure there are), then write a renderer that
produces that output. IIRC, the actual gprof output is limited in
that it hides some of the details, and I think that I saw some
output format that is designed to provide more information in a
machine parsable way. BTW, it's easy to write such renderers, and
it's even easier to write some trivial renderer that dumps the call
chains that it observed -- in case you know of a tool that takes
that as its input.
--
((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay:
http://barzilay.org/ Maze is Life!