[plt-scheme] Statistics (V301.5 Speed Up)

From: Greg Woodhouse (gregory.woodhouse at sbcglobal.net)
Date: Fri Feb 10 15:50:54 EST 2006

--- "Will M. Farr" <farr at MIT.EDU> wrote:

> Whoops---just sent this to Greg.  Sorry for the duplicate copy, Greg.
> 
> On 10 Feb 2006, at 3:03 PM, Greg Woodhouse wrote:
> 
> > The classical Central Limit Theorem (CLT)
> > essentially says that if you have a collection {X(1), X(2), ...} of
> > *independent* random variables sharing the same probability
> > distribution, then in the limit the sum approaches a normal
> (Gaussian)
> > distribution.
> 
> This is true, but the numerical size of "in the limit" for real 
> distributions is often so large as to be useless.  

Yes, you're right, there is no guarantee here.

> Also, there are
> some 
> restrictions on the probability distribution of the X(1), X(2), ..., 
> which may be relevant in this case (since the distribution of running
> 
> times is going to be very unusual---discontinuous, even).  

Hmmm...I'll have to defer to you on that one. I do expect them to be
IID, but I'm a little unsure of what effect a consistent step
discontinuity might have. The long tails (which you mention below)
worry me more. Queuing theory is sure not my strong point (nor is
statistics, for that matter), but I'm trying to decide whether a
decreasing entrance cost (more cache hits) would be relevant here. But
never mind data: there is the instruction cache to worry about.

> Given all 
> this, the ratio of smallest running times between compiled and 
> uncompiled code may be a more robust estimator of the true "speedup" 
> from compiling in the presence of real-world noise than an average.  

Is the idea here to get the best scenario (good cache hits, no
preemption at the system level, etc.)? That certainly would make sense,
if you think th best result it the purest, in the sense of having the
least interference from outside factors not directly related to the
problem at hand.

> (I'm not saying it is, but without some supporting data on the 
> gaussianity of the distribution of the sum of, say, 1000 trials, I 
> don't think the CLT gets you anywhere in practice.)
> 
> Personally, I would be much less suspicious of the ratio of min run 
> times over 1000 trials than I would be of the ratio of averages,
> given 
> that the run time is, in principle, bounded below and unbounded
> above.
> 
> Will
> 


===
Gregory Woodhouse  <gregory.woodhouse at sbcglobal.net>
"All truth passes through three stages: First, it is ridiculed.
Second, it is violently opposed. Third, it is accepted as
being self-evident."
--Arthur Schopenhauer


Posted on the users mailing list.