[racket] plot request/patch: independent control of y axis in density plots
On 02/29/2012 05:15 PM, John Clements wrote:
> Plot's new "density" function is awesome. I'd like to add something to it, though; independent control of the y axis.
>
> Here's the motivating scenario; I'm looking at server logs, to try to see which users are hammering the handin server hardest. Suppose I take a list of numbers representing the seconds on which a submission occurred. I can plot the density of these using (density …), but what I get is the relative density, rather than the absolute density. In this case, I want the y axis to have the units "elements per unit time". This is different from an application such as the one in the docs where the number of data points is irrelevant.
>
> This problem becomes much more acute when I'm trying to compare two different sets of server logs; the current behavior essentially normalizes w.r.t. the number of points.
>
> The easiest way to fix this is just to allow the user to have independent control over the y scaling, so that you can for instance write:
>
> (plot (density all-seconds 0.0625
> #:y-adjust (/ 1 (length all-seconds)))
> #:width 800)
>
> to get a graph that shows density in hits per second.
If you're only plotting the density graph, you could currently do this:
(define scale (/ 1 (length all-seconds)))
(parameterize ([plot-y-ticks (ticks-scale (plot-y-ticks)
(linear-scale scale))])
(plot (density all-seconds 0.0625)))
But you probably don't want to. First, some background.
A Kernel Density Estimator (KDE) like `density' constructs an estimate
of the probability distribution that generated some samples. It does
this by centering a "kernel" at every point, adding them up pointwise,
and normalizing. Conceptually, anyway; `density' uses a specialized
algorithm that is efficient even with hundreds of thousands of samples,
but only works with Gaussian kernels.
Using `density' to smear discrete points and accumulate them is a hack
that will probably come back to haunt you sometime. You've already found
one reason. There are two others, both of which come from the fact that
KDEs are designed to converge to the correct density as the number of
samples increases.
1. The kernel width has to be a function of the number of samples,
which approaches zero as the number of samples increases. You've
compensated for this, sort of, by multiplying the width by 0.0625. That
won't always get the result you want.
2. The kernels are almost always symmetric, and probably not the shape
you really want.
If you want to smear points and accumulate them in a way that properly
represents server load, you should add up your own kernels that
represent the resources it takes to process an assignment. This might
help you get started:
(define (((kernel width) s) x)
(exp (* -1/2 (sqr (/ (- s x) width)))))
(define width 0.01)
(define kernels (map (kernel width) all-seconds))
(plot (function (λ (y) (apply + (map (λ (k) (k y)) kernels)))
(- (apply min all-seconds) (* width 4))
(+ (apply max all-seconds) (* width 4))))
The kernel in this case is an unnormalized Gaussian centered on the log
time. Using it means assuming that the log message is recorded in the
exact middle of processing an assignment, that the middle of processing
has the highest server load, and that the load is symmetric.
Wow, that ended up way longer than I intended.
Neil ⊥