[plt-scheme] PLT with very large datasets?

From: Yavuz Arkun (yarkun at gmail.com)
Date: Mon Jun 30 03:58:46 EDT 2008

On Mon, Jun 30, 2008 at 09:50, Mark Engelberg <mark.engelberg at gmail.com> wrote:

> [...] If I remember
> correctly, customer/rating pairs can be stored in a single int by
> storing the customer number in the first 29 bits, and the rating in
> the last 3 bits.

Since there are (I think; still re-downloading the dataset) about 500k
distinct user IDs, 20bits should be enough if I recode them, as long
as I keep a record of recoded ID-to-Netflix ID mapping to generate the
Netflix submission files. The ratings are more problematic because
even though the original ratings require just 3 bits, if I need to
store some calculated intermediate values to be used for predictions,
those would need to be real numbers with more bits to have sufficient
precision. According to Eli's post, there would be 10-11 bits
available in a fixnum for the ratings.

> Alternatively, you can be slightly more efficient by creating one
> enormous array of ints, where all the customer/rating pair ints for
> movie 1 are stored sequentially (in sorted order), and then all the
> ints for movie 2, and so on and have the indexing array point to the
> address where each block of pairs begins for a given movie.
>
> You can look up a specific movie/customer combination by using the
> index to go to the appropriate block of customer ratings for a given
> movie, and then doing a binary search within that block.
>

Yes, that sounds like a reasonable plan.

> The other direction is a bit trickier to build an index for, because
> there are many customers, sparsely distributed numerically.  But
> again, the same principle holds.  You store movie/rating pairs as a
> huge array of ints, grouped in blocks according to the customer number
> in sorted order.   Again, you want to build some sort of index that
> allows you to take a customer number and find the starting address of
> the corresponding block of movie/rating pairs.  I found it impractical
> to represent the index directly as an array (depending on your memory,
> you may be able to), but there are plenty of other schemes that would
> work.  I think I did an array-representation of a binary tree,
> requiring log2(# of customers) steps to look up the starting address
> for the block of customers.

All this indexing is where the task becomes close to what a db program
does. I had hoped to bypass it by using MySQL, but as you said,
relational databases are too slow for this task (probably because
their algorithms are optimized for external databases.)

>
> Building this array and index was a bit of a pain, because the netflix
> dataset is distributed in a form organized by movie, not customer.  So
> sorting this large amount of data by customer number is challenging if
> your memory is limited.  Mergesort, with blocks sized appropriately
> for your memory, is probably the way to go.

Time to read up on it then ;)

> But once you've done all this, you've got two organizations of the
> same data, totalling approx. 700MB, allowing you to very rapidly look
> up all customer/rating pairs of a given movie, or all movie/rating
> pairs for a given customer, or a rating for a given movie/customer
> combination in a very small number of steps.

Yup, this is about as far as I got using MySQL last time: I generated
2 data tables, one sorted and indexed by movie ID, the other by user
ID. Got stuck there though, because even calculating simple statistics
such as averages by Movie or User took days.

Thanks again for the hints :)
--Yavuz


Posted on the users mailing list.