[plt-scheme] PLT with very large datasets?

From: Yavuz Arkun (yarkun at gmail.com)
Date: Sun Jun 29 17:47:38 EDT 2008

Hello,
I am thinking about playing around with the dataset for the Netflix
competition (http://www.netflixprize.com/) using PLT Scheme, and
before doing that, I thought I might consult the wise men of the list
to see if it is even feasible.

The bulk of the dataset are about 100 million triplets: User ID, Movie
ID, rating. There are about 20k unique Movie IDs and 500k User IDs,
and the ratings are integers 1-5, but I think will need ratings need
to be floats for processing. So, in theory, about 9 bytes per triplet,
which comes to 900MB for one copy of the dataset.

I am not sure what kind of space behavior relevant algorithms have,
but for arguments sake, lets assume that I might have to work with 2
copies of the dataset at any given time.

So the question is: what are the memory limitations of PLT Scheme?
Could I read the whole dataset in, assuming I have enough RAM? Does
the OS play a role? (I can use XP, OS X or Linux, but prefer OS X.)

If I cannot work in RAM, is there a way to memory map the data
efficiently, using standard facilities of PLT Scheme?

What would be the most efficient data type to use, immutable vectors
of 3 numbers in immutable vectors? Or are there more specialized types
that I could use?

In short, is working with this dataset feasible in PLT Scheme?

I am sorry for the vagueness of my request and thanks in advance for
any hints you might give.

--Yavuz


Posted on the users mailing list.