[plt-scheme] PLT with very large datasets?

From: Mark Engelberg (mark.engelberg at gmail.com)
Date: Mon Jun 30 02:50:23 EDT 2008

For most of the algorithms you'd want to experiment with, you'll want
to iterate through all the customer/rating pairs for a given movie, or
iterate through all the movie/rating pairs for a given customer.
Also, it is useful to be able to lookup the specific rating of a given
movie/customer combination.

Supporting the iteration of customer/rating pairs of a given movie is
the simplest (because movie numbers are sequential).  If I remember
correctly, customer/rating pairs can be stored in a single int by
storing the customer number in the first 29 bits, and the rating in
the last 3 bits.  So you create an "index" array that maps movie
numbers to vectors of customer/rating pairs stored as ints.
Alternatively, you can be slightly more efficient by creating one
enormous array of ints, where all the customer/rating pair ints for
movie 1 are stored sequentially (in sorted order), and then all the
ints for movie 2, and so on and have the indexing array point to the
address where each block of pairs begins for a given movie.

You can look up a specific movie/customer combination by using the
index to go to the appropriate block of customer ratings for a given
movie, and then doing a binary search within that block.

The other direction is a bit trickier to build an index for, because
there are many customers, sparsely distributed numerically.  But
again, the same principle holds.  You store movie/rating pairs as a
huge array of ints, grouped in blocks according to the customer number
in sorted order.   Again, you want to build some sort of index that
allows you to take a customer number and find the starting address of
the corresponding block of movie/rating pairs.  I found it impractical
to represent the index directly as an array (depending on your memory,
you may be able to), but there are plenty of other schemes that would
work.  I think I did an array-representation of a binary tree,
requiring log2(# of customers) steps to look up the starting address
for the block of customers.

Building this array and index was a bit of a pain, because the netflix
dataset is distributed in a form organized by movie, not customer.  So
sorting this large amount of data by customer number is challenging if
your memory is limited.  Mergesort, with blocks sized appropriately
for your memory, is probably the way to go.

But once you've done all this, you've got two organizations of the
same data, totalling approx. 700MB, allowing you to very rapidly look
up all customer/rating pairs of a given movie, or all movie/rating
pairs for a given customer, or a rating for a given movie/customer
combination in a very small number of steps.

--Mark

On Sun, Jun 29, 2008 at 11:04 PM, Yavuz Arkun <yarkun at gmail.com> wrote:
> On Mon, Jun 30, 2008 at 06:01, Mark Engelberg <mark.engelberg at gmail.com> wrote:
>> I played around a bit with the dataset when it first came out.
>>
>> My laptop is fairly slow -- 1gb of RAM, a dual 1.6ghz processor, and a
>> 5400rpm hard drive.
>>
>> First, I tried a couple of database programs to manage the data, but
>> found that accessing data stored on the hard drive was far too slow,
>> and the only way I was going to get reasonable performance was by
>> storing the whole thing in memory.
>
> I, too, did try MySQL about 1 yr ago. It was too slow.
>
>> Since my memory was limited, I
>> read the book "Managing Gigabytes" to get some ideas on how to index
>> and compress the data in a way that would fit in my memory footprint,
>> which I was eventually able to do.
>
> I will see if I can get hold of that book.
>
>>
>> I started out using Python, with homogenous vectors to keep the memory
>> footprint low, but it was too slow to analyze all the data.  I
>> eventually rewrote everything in C++, which resulted in about a 100x
>> speedup.  This allowed my program to run in a couple of hours, as
>> opposed to a couple of weeks.  PLTScheme's performance is roughly on
>> par with Python's as far as I know, so my guess is that unless you
>> have a dramatically faster computer than I do, you'll find that
>> PLTScheme is too slow to process the Netflix dataset.
>
> Hmm, my PC is a bit faster (at best 2x) and has 2x memory. This is
> certainly worth keeping in mind, especially since I am more interested
> in exploring (= trying out lots of silly ideas) rather than
> implementing something specific. One problem is that I know very
> little C, and I am clueless about C++. Oh well.
>
>>
>> It was fun to work on the Netflix contest, and I wish you the best of luck!
>>
>> --Mark
>
> Thanks :)
> --Yavuz
> _________________________________________________
>  For list-related administrative tasks:
>  http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>


Posted on the users mailing list.