[racket] Understanding GC when working with streams

From: Lawrence Woodman (lwoodman at vlifesystems.com)
Date: Sat Sep 7 03:52:48 EDT 2013

Hello,

I'm trying to understand how memory is allocated and collected when 
working with streams.
I recently asked a question about how to limit memory when using streams 
on Stackoverflow
and got two good answers:
http://stackoverflow.com/questions/18629188/how-to-limit-memory-use-when-using-a-stream

However, I'm seeking a better understanding than could really be given 
through the SO format.
I want to use streams because I have too much data to fit in memory and 
hence want to use them
to bring in the data from files and databases sequentially as needed.  
However, I'm finding that
the GC is not collecting as I would have hoped and hence streams are not 
quite as straightforward
a solution as I expected.  The sort of problems that I am experiencing 
are demonstrated with the
following code

   #lang racket
   (require rackunit)

   ; This program fails with out of memory errors when memory limit set 
to 128mb
   ; It always fails when it comes to testing filtered-nums, regardless 
of how test-nums?
   ; and test-gen-nums? have been set.  However 
test-for/sum-gen-filtered-nums?
   ; also fails if set.

   (define max-num 10000000)
   (define test-nums? #f)
   (define test-gen-filtered-nums? #f)
   (define test-for/sum-gen-filtered-nums? #f)

   (define nums (in-range max-num))
   (define filtered-nums
     (stream-filter (? (i) (values #t)) nums))

   (define (gen-filtered-nums)
     (stream-filter (? (i) (values #t)) nums))

   (when test-nums?
     (displayln "Testing nums")
     (check-equal? max-num (stream-length nums)))

   (when test-gen-filtered-nums?
     (displayln "Testing gen-filtered-nums")
     (check-equal? max-num (stream-length (gen-filtered-nums))))

   (when test-for/sum-gen-filtered-nums?
     (displayln "Testing with for/sum-gen-filtered-nums ")
     (check-equal? max-num (for/sum ([i (gen-filtered-nums)]) 1)))


   (displayln "Testing filtered-nums")
   (check-equal? max-num (stream-length filtered-nums))


I understand that making multiple passes through a big data is inefficient,
but here I am trying to gain a better understanding of the GC. So this leads
me to a few related questions:

   i.  Why does the GC seem to collect more effectively when the stream is
       created in a function as opposed to in a straight definition? i.e
       test-gen-filtered-nums? passes, although I note that
       test-for/sum-gen-filtered-nums? doesn't.
   ii.  Is stream-filter inappropriate to use with big data sets?
   iii. Is there a better choice than streams for dealing with big data 
sets, coming from
       disparate sources such as files, databases, etc,  within racket?


Thanks



Lorry


-- 
vLife Systems Ltd
Registered Office: The Meridian, 4 Copthall House, Station Square, Coventry, CV1 2FL
Registered in England and Wales No. 06477649
http://vlifesystems.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20130907/17a10080/attachment.html>

Posted on the users mailing list.