[plt-scheme] Lazy streams and memory in sxml's lazy:xml->sxml / lazy:sxpath?

From: Danny Yoo (dyoo at hkn.eecs.berkeley.edu)
Date: Mon May 8 19:09:54 EDT 2006

> i think the reasons for the growing memory consumption are explained in 
> http://srfi.schemers.org/srfi-45/srfi-45.html
> it looks as though SICP and srfi-40 only scratch the surface of lazyness 
> ;-)

Hi Evangelos,

Ok, sure.

>From my understanding so far, lazy:xml->sxml isn't yet an appropriate 
function to do a memory-bounded, pull-style traversal through XML.

The problem I have is that most of the XML files I play with as a hobby 
(now that I don't have a job *grin*) are bioinformatics datasets.  The 
major characteristic of these files is that they are too huge to hold in 
memory at once, so the bounded memory requirement is a solid constraint I 
can't ignore.

I wanted to make sure I wasn't running into a weird issue with 
lazy:sxpath by doing an explicit traversal through the lazy structure, 
using the structure provided by lazy:xml->sxml alone:

(module test-lazy-parsing mzscheme
   (require (planet "sxml.ss" ("lizorkin" "sxml.plt" 1 4))
            (lib "plt-match.ss")
            (lib "pretty.ss")
            (lib "list.ss"))

   ;; print-fragments: -> void
   ;; Displays all the sxml fragments for each GO term.
   (define (print-fragments)
     ;; Note: go_daily-termdb.rdf-xml can be found at:
     ;; http://archive.godatabase.org/latest-termdb/
     (let* ([ip (open-input-file "~/go_daily-termdb.rdf-xml")]
               '([go . "http://www.geneontology.org/dtds/go.dtd#"]
                 [rdf . "http://www.w3.org/1999/02/22-rdf-syntax-ns#"]))]
            [lazy-terms (handle-go-children
                          (handle-top lazy-doc)))])
       (fold-lazy-list (lambda (sxml acc)
                         (pretty-print sxml)

   ;; handle-top: sxml -> (listof lazy-sxml)
   (define (handle-top sxml)
     (match sxml
       [(list '*TOP* top-children ...)

   ;; handle-top-children: (listof lazy-sxml) -> (listof lazy-sxml)
   (define (handle-top-children sxml)
     (match sxml
       [(list (list 'go:go go-children ...))
       [(list other-elt rest ...)
        (handle-top-children rest)]))

   ;; handle-go-children: (listof lazy-sxml) -> (listof lazy-sxml)
   (define (handle-go-children sxml)
     (match sxml
       [(list (list 'rdf:RDF lazy-go-terms ...) _ ...)

   ;; fold-lazy-list: (X Y -> Y) Y (listof lazy-sxml) -> Y
   ;; Given the particular kind of lazy-list given by sxml's
   ;; lazy:xml->sxml, does a fold across its structure.
   (define (fold-lazy-list f acc lazy-list)
     (cond [(empty? lazy-list)
            (fold-lazy-list f
                            (f (first lazy-list) acc)
                            (force (second lazy-list)))]))


Through both mzscheme and mzscheme3m, I see the same unbounded memory 
problem I see in my other code example, so the problem seems intrinsic in 
the data structure returned by lazy:xml->sxml.  I did take a closer look 
at the implementation of lazy:xml->sxml, and it's full of continuation 
magic that makes this a bit nontrivial to understand.

But I need to analyze what it's doing.  I don't yet see why there can't be 
a "right" way to fix this that preserves the laziness with the 
bounded-memory constraint.

Thanks again for the SRFI references; I'll read them more thoroughly to 
understand the basic problem better.  Best of wishes!

Posted on the users mailing list.