[plt-scheme] PLT with very large datasets?
Hello,
I am thinking about playing around with the dataset for the Netflix
competition (http://www.netflixprize.com/) using PLT Scheme, and
before doing that, I thought I might consult the wise men of the list
to see if it is even feasible.
The bulk of the dataset are about 100 million triplets: User ID, Movie
ID, rating. There are about 20k unique Movie IDs and 500k User IDs,
and the ratings are integers 1-5, but I think will need ratings need
to be floats for processing. So, in theory, about 9 bytes per triplet,
which comes to 900MB for one copy of the dataset.
I am not sure what kind of space behavior relevant algorithms have,
but for arguments sake, lets assume that I might have to work with 2
copies of the dataset at any given time.
So the question is: what are the memory limitations of PLT Scheme?
Could I read the whole dataset in, assuming I have enough RAM? Does
the OS play a role? (I can use XP, OS X or Linux, but prefer OS X.)
If I cannot work in RAM, is there a way to memory map the data
efficiently, using standard facilities of PLT Scheme?
What would be the most efficient data type to use, immutable vectors
of 3 numbers in immutable vectors? Or are there more specialized types
that I could use?
In short, is working with this dataset feasible in PLT Scheme?
I am sorry for the vagueness of my request and thanks in advance for
any hints you might give.
--Yavuz