[plt-scheme] PLT with very large datasets?
On Mon, Jun 30, 2008 at 06:01, Mark Engelberg <mark.engelberg at gmail.com> wrote:
> I played around a bit with the dataset when it first came out.
>
> My laptop is fairly slow -- 1gb of RAM, a dual 1.6ghz processor, and a
> 5400rpm hard drive.
>
> First, I tried a couple of database programs to manage the data, but
> found that accessing data stored on the hard drive was far too slow,
> and the only way I was going to get reasonable performance was by
> storing the whole thing in memory.
I, too, did try MySQL about 1 yr ago. It was too slow.
> Since my memory was limited, I
> read the book "Managing Gigabytes" to get some ideas on how to index
> and compress the data in a way that would fit in my memory footprint,
> which I was eventually able to do.
I will see if I can get hold of that book.
>
> I started out using Python, with homogenous vectors to keep the memory
> footprint low, but it was too slow to analyze all the data. I
> eventually rewrote everything in C++, which resulted in about a 100x
> speedup. This allowed my program to run in a couple of hours, as
> opposed to a couple of weeks. PLTScheme's performance is roughly on
> par with Python's as far as I know, so my guess is that unless you
> have a dramatically faster computer than I do, you'll find that
> PLTScheme is too slow to process the Netflix dataset.
Hmm, my PC is a bit faster (at best 2x) and has 2x memory. This is
certainly worth keeping in mind, especially since I am more interested
in exploring (= trying out lots of silly ideas) rather than
implementing something specific. One problem is that I know very
little C, and I am clueless about C++. Oh well.
>
> It was fun to work on the Netflix contest, and I wish you the best of luck!
>
> --Mark
Thanks :)
--Yavuz