[plt-scheme] PLT with very large datasets?

From: Yavuz Arkun (yarkun at gmail.com)
Date: Mon Jun 30 02:04:35 EDT 2008

On Mon, Jun 30, 2008 at 06:01, Mark Engelberg <mark.engelberg at gmail.com> wrote:
> I played around a bit with the dataset when it first came out.
>
> My laptop is fairly slow -- 1gb of RAM, a dual 1.6ghz processor, and a
> 5400rpm hard drive.
>
> First, I tried a couple of database programs to manage the data, but
> found that accessing data stored on the hard drive was far too slow,
> and the only way I was going to get reasonable performance was by
> storing the whole thing in memory.

I, too, did try MySQL about 1 yr ago. It was too slow.

> Since my memory was limited, I
> read the book "Managing Gigabytes" to get some ideas on how to index
> and compress the data in a way that would fit in my memory footprint,
> which I was eventually able to do.

I will see if I can get hold of that book.

>
> I started out using Python, with homogenous vectors to keep the memory
> footprint low, but it was too slow to analyze all the data.  I
> eventually rewrote everything in C++, which resulted in about a 100x
> speedup.  This allowed my program to run in a couple of hours, as
> opposed to a couple of weeks.  PLTScheme's performance is roughly on
> par with Python's as far as I know, so my guess is that unless you
> have a dramatically faster computer than I do, you'll find that
> PLTScheme is too slow to process the Netflix dataset.

Hmm, my PC is a bit faster (at best 2x) and has 2x memory. This is
certainly worth keeping in mind, especially since I am more interested
in exploring (= trying out lots of silly ideas) rather than
implementing something specific. One problem is that I know very
little C, and I am clueless about C++. Oh well.

>
> It was fun to work on the Netflix contest, and I wish you the best of luck!
>
> --Mark

Thanks :)
--Yavuz


Posted on the users mailing list.