[plt-scheme] PLT with very large datasets?

Sun Jun 29 23:01:52 EDT 2008

I played around a bit with the dataset when it first came out.

My laptop is fairly slow -- 1gb of RAM, a dual 1.6ghz processor, and a
5400rpm hard drive.

First, I tried a couple of database programs to manage the data, but
found that accessing data stored on the hard drive was far too slow,
and the only way I was going to get reasonable performance was by
storing the whole thing in memory.  Since my memory was limited, I
read the book "Managing Gigabytes" to get some ideas on how to index
and compress the data in a way that would fit in my memory footprint,
which I was eventually able to do.

I started out using Python, with homogenous vectors to keep the memory
footprint low, but it was too slow to analyze all the data.  I
eventually rewrote everything in C++, which resulted in about a 100x
speedup.  This allowed my program to run in a couple of hours, as
opposed to a couple of weeks.  PLTScheme's performance is roughly on
par with Python's as far as I know, so my guess is that unless you
have a dramatically faster computer than I do, you'll find that
PLTScheme is too slow to process the Netflix dataset.

It was fun to work on the Netflix contest, and I wish you the best of luck!

--Mark

On Sun, Jun 29, 2008 at 2:47 PM, Yavuz Arkun <yarkun at gmail.com> wrote:
> Hello,
> I am thinking about playing around with the dataset for the Netflix
> competition (http://www.netflixprize.com/) using PLT Scheme, and
> before doing that, I thought I might consult the wise men of the list
> to see if it is even feasible.
>
> The bulk of the dataset are about 100 million triplets: User ID, Movie
> ID, rating. There are about 20k unique Movie IDs and 500k User IDs,
> and the ratings are integers 1-5, but I think will need ratings need
> to be floats for processing. So, in theory, about 9 bytes per triplet,
> which comes to 900MB for one copy of the dataset.
>
> I am not sure what kind of space behavior relevant algorithms have,
> but for arguments sake, lets assume that I might have to work with 2
> copies of the dataset at any given time.
>
> So the question is: what are the memory limitations of PLT Scheme?
> Could I read the whole dataset in, assuming I have enough RAM? Does
> the OS play a role? (I can use XP, OS X or Linux, but prefer OS X.)
>
> If I cannot work in RAM, is there a way to memory map the data
> efficiently, using standard facilities of PLT Scheme?
>
> What would be the most efficient data type to use, immutable vectors
> of 3 numbers in immutable vectors? Or are there more specialized types
> that I could use?
>
> In short, is working with this dataset feasible in PLT Scheme?
>
> I am sorry for the vagueness of my request and thanks in advance for
> any hints you might give.
>
> --Yavuz
> _________________________________________________
>  For list-related administrative tasks:
>  http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>