[plt-scheme] PLT with very large datasets?

From: Chongkai Zhu (czhu at cs.utah.edu)
Date: Sun Jun 29 17:55:00 EDT 2008

My 2 cents:

1. PLT's GC start to work if you use half of the memory. So if you 
actual data consumes 1G, you need at least 2G of memory

2. you said "in theory, about 9 bytes per triplet", but Scheme is a 
dynamic typed language, which means some memory is used as type tag.

Other than that two, I can see no other  memory limitations of PLT Scheme.

Chongkai

Yavuz Arkun wrote:
> Hello,
> I am thinking about playing around with the dataset for the Netflix
> competition (http://www.netflixprize.com/) using PLT Scheme, and
> before doing that, I thought I might consult the wise men of the list
> to see if it is even feasible.
>
> The bulk of the dataset are about 100 million triplets: User ID, Movie
> ID, rating. There are about 20k unique Movie IDs and 500k User IDs,
> and the ratings are integers 1-5, but I think will need ratings need
> to be floats for processing. So, in theory, about 9 bytes per triplet,
> which comes to 900MB for one copy of the dataset.
>
> I am not sure what kind of space behavior relevant algorithms have,
> but for arguments sake, lets assume that I might have to work with 2
> copies of the dataset at any given time.
>
> So the question is: what are the memory limitations of PLT Scheme?
> Could I read the whole dataset in, assuming I have enough RAM? Does
> the OS play a role? (I can use XP, OS X or Linux, but prefer OS X.)
>
> If I cannot work in RAM, is there a way to memory map the data
> efficiently, using standard facilities of PLT Scheme?
>
> What would be the most efficient data type to use, immutable vectors
> of 3 numbers in immutable vectors? Or are there more specialized types
> that I could use?
>
> In short, is working with this dataset feasible in PLT Scheme?
>
> I am sorry for the vagueness of my request and thanks in advance for
> any hints you might give.
>
> --Yavuz
> _________________________________________________
>   For list-related administrative tasks:
>   http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>   



Posted on the users mailing list.