[plt-scheme] PLT with very large datasets?

From: Matthias Felleisen (matthias at ccs.neu.edu)
Date: Sun Jun 29 18:31:09 EDT 2008


It sounds like you have a pretty good idea about your performance
requirements. If I were you, I would conduct some basic stress tests
for data representations and algorithms that come close to what
the contest demands. Make up random data, write to a database,
and then see what happens. It can't take more than a couple of
hours to get an idea about the performance of some alternatives
(list, vectors, heterogenous vectors, etc.) -- Matthias




On Jun 29, 2008, at 5:47 PM, Yavuz Arkun wrote:

> Hello,
> I am thinking about playing around with the dataset for the Netflix
> competition (http://www.netflixprize.com/) using PLT Scheme, and
> before doing that, I thought I might consult the wise men of the list
> to see if it is even feasible.
>
> The bulk of the dataset are about 100 million triplets: User ID, Movie
> ID, rating. There are about 20k unique Movie IDs and 500k User IDs,
> and the ratings are integers 1-5, but I think will need ratings need
> to be floats for processing. So, in theory, about 9 bytes per triplet,
> which comes to 900MB for one copy of the dataset.
>
> I am not sure what kind of space behavior relevant algorithms have,
> but for arguments sake, lets assume that I might have to work with 2
> copies of the dataset at any given time.
>
> So the question is: what are the memory limitations of PLT Scheme?
> Could I read the whole dataset in, assuming I have enough RAM? Does
> the OS play a role? (I can use XP, OS X or Linux, but prefer OS X.)
>
> If I cannot work in RAM, is there a way to memory map the data
> efficiently, using standard facilities of PLT Scheme?
>
> What would be the most efficient data type to use, immutable vectors
> of 3 numbers in immutable vectors? Or are there more specialized types
> that I could use?
>
> In short, is working with this dataset feasible in PLT Scheme?
>
> I am sorry for the vagueness of my request and thanks in advance for
> any hints you might give.
>
> --Yavuz
> _________________________________________________
>   For list-related administrative tasks:
>   http://list.cs.brown.edu/mailman/listinfo/plt-scheme



Posted on the users mailing list.