[plt-scheme] PLT with very large datasets?

From: Eli Barzilay (eli at barzilay.org)
Date: Sun Jun 29 18:11:36 EDT 2008

Yavuz Arkun wrote:
>
> If I cannot work in RAM, is there a way to memory map the data
> efficiently, using standard facilities of PLT Scheme?

Sounds like working in RAM is not a good idea -- all you need is for
netflix to stay around a little longer and get some more customers,
and you end up with twice the space and more.  A memory-mapping might
be difficult to do, and I don't have an experience with that -- but
I'm not sure it will be the right thing either.  Depending on your
algorithm, you might want to have better control of what stays in
memory for a particular chunk of code.

So I think that your best bet is to do the usual work: read the data
and save it in an indexed file, then create a wrapper that lets you
deal with persistent values from the file that handles the reading and
caching automatically and efficiently.  It sounds like you only need
to read entries -- and if they're fixed length then you don't even
need to deal with indexing.


On Jun 29, Chongkai Zhu wrote:
> My 2 cents:
> 
> 1. PLT's GC start to work if you use half of the memory. So if you 
> actual data consumes 1G, you need at least 2G of memory
> 
> 2. you said "in theory, about 9 bytes per triplet", but Scheme is a 
> dynamic typed language, which means some memory is used as type tag.

One way to cut down to the minimum memory needed is to use a
heterogeneous vector (either through srfi-4 or the foreign
interface).  This means that only the numbers are stored, which should
be easy with some wrapper functions to make it look like a vector of
triplets.

Yet another option which is similar to this is to just use one big
byte string holding the file data, and use
`floating-point-bytes->real' to read numbers from specific positions.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!


Posted on the users mailing list.