[plt-scheme] Statistics for Sequences

From: Doug Williams (m.douglas.williams at gmail.com)
Date: Thu Sep 10 10:30:12 EDT 2009

It's interesting that if I use (in-vector ...) in the for/fold statements,
the times for the for/fold version are about the same as for the (uglier) do
version (with vector-refs). [This one probably would benefit from Matthew's
performance improvements.] Actually using it would mean giving up the
flexibility in going to sequences in the first place, but it means there is
some hope of eventually getting the same performance for the sequence
versions (at least for vectors).

using in-vector in the for
cpu time: 266 real time: 265 gc time: 0
cpu time: 250 real time: 250 gc time: 47

current science collection routines
cpu time: 250 real time: 249 gc time: 0
cpu time: 218 real time: 218 gc time: 16

It would be nice if (for ((x some-vector)) ...) and (for ((x (in-vector
some-vector))) ...) had similar performance. I realize that at expansion
time the latter knows to expect a vector while the former does not and can
generate code accordingly. But, I can dream.

On Wed, Sep 9, 2009 at 9:37 AM, Doug Williams
<m.douglas.williams at gmail.com>wrote:

> Thanks for running them for me. I guess it comes down to whether the
> flexibility is worth the performance hit. I like the flexibility. In the
> past there were times I have had to convert lists to vectors just to compute
> statistics on them, which is even less efficient.  I could include the old
> ones as vector-mean, vector-variance, etc for people who need/want the
> performance.
>
> Doug
>
>
> On Wed, Sep 9, 2009 at 9:29 AM, Matthew Flatt <mflatt at cs.utah.edu> wrote:
>
>> I don't think the latest changes will affect the performance, since
>> unsafe operations are only used for `in-vector' and (sometimes)
>> `in-range' when they appear immediately in a `for' right-hand side.
>>
>> Times on my machine:
>>
>>  New
>>  laptop% mzscheme time-statistics.ss
>>  cpu time: 576 real time: 578 gc time: 11
>>  cpu time: 450 real time: 451 gc time: 10
>>  (that's without `in-vector'; times using `in-vector' are the same)
>>
>>  Old
>>  laptop% mzscheme time-statistics.ss
>>  cpu time: 233 real time: 237 gc time: 18
>>  cpu time: 196 real time: 198 gc time: 10
>>
>> At Wed, 9 Sep 2009 09:18:55 -0600, Doug Williams wrote:
>> > I've reimplemented the statistics module from the science collection to
>> use
>> > sequences instead of just vectors. I like the generality better - I can
>> use
>> > any sequence (e.g., vector or list) - but there is more of performance
>> hit
>> > than I would have liked. I haven't timed it with the new changes that
>> > Matthew just put it. The good news is that there isn't much of a hit for
>> > using (variance data) as opposed to (variance (in-vector data)) and
>> there
>> > isn't a huge hit for using the contract that ensures that the sequence
>> is a
>> > sequence of real numbers.
>> >
>> > I created a 100000 element vector and timed a loop getting the variance
>> of
>> > the elements 10 times. Note that I create an executable that runs
>> compiled
>> > code in both cases. [Runs of the sequence code within DrScheme are about
>> > twice the times of the compiled code - I assume they run from byte code
>> in
>> > that case. Runs of the science collection code is about the same in
>> DrScheme
>> > - I assume they run the compiled code.]
>> >
>> > Times using sequences [primarily using 'for/fold' for sequencing and
>> > referencing]:
>> >
>> > (variance data) : cpu time: 625 real time: 625 gc time: 32
>> > (unchecked-variance data) : cpu time: 531 real time: 531 gc time: 77
>> >
>> > (variance (in-vector data)) : cpu time: 609 real time: 609 gc time: 16
>> > (unchecked-variance (in-vector data)) : cpu time: 485 real time: 484 gc
>> > time: 0
>> >
>> > Times using vectors (current science collection routines) [primarily
>> using
>> > 'do' for sequencing with 'vector-ref' for referencing]:
>> >
>> > (variance data) : cpu time: 235 real time: 234 gc time: 16
>> > (unchecked-variance data) : cpu time: 187 real time: 188 gc time: 46
>> >
>> > All of the normal caveats about timing values apply - just because I'm
>> > timing a statistics routine doesn't been it's statistically relevant :).
>> >
>> > I will retime them when there is a nightly build with Matthew's
>> performance
>> > improvements is available (it seems that 4.2.1.7 from Saturday is the
>> > latest) - or I build it on my machine at home. I don't have the
>> development
>> > tools on my laptop to build from svn.
>> >
>> > I've attached the files in case anyone wants to look them over. If
>> someone
>> > could run them against the latest svn, it would be nice. [
>> >
>> > Comments from anyone that uses these routines from the science
>> collection
>> > would be most welcome.
>> >
>> > Doug
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20090910/954db939/attachment.html>

Posted on the users mailing list.