[plt-scheme] Lazy streams and memory in sxml's lazy:xml->sxml / lazy:sxpath?
> i think the reasons for the growing memory consumption are explained in
> http://srfi.schemers.org/srfi-45/srfi-45.html
>
> it looks as though SICP and srfi-40 only scratch the surface of lazyness
> ;-)
Hi Evangelos,
Ok, sure.
>From my understanding so far, lazy:xml->sxml isn't yet an appropriate
function to do a memory-bounded, pull-style traversal through XML.
The problem I have is that most of the XML files I play with as a hobby
(now that I don't have a job *grin*) are bioinformatics datasets. The
major characteristic of these files is that they are too huge to hold in
memory at once, so the bounded memory requirement is a solid constraint I
can't ignore.
I wanted to make sure I wasn't running into a weird issue with
lazy:sxpath by doing an explicit traversal through the lazy structure,
using the structure provided by lazy:xml->sxml alone:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(module test-lazy-parsing mzscheme
(require (planet "sxml.ss" ("lizorkin" "sxml.plt" 1 4))
(lib "plt-match.ss")
(lib "pretty.ss")
(lib "list.ss"))
;; print-fragments: -> void
;; Displays all the sxml fragments for each GO term.
(define (print-fragments)
;; Note: go_daily-termdb.rdf-xml can be found at:
;; http://archive.godatabase.org/latest-termdb/
(let* ([ip (open-input-file "~/go_daily-termdb.rdf-xml")]
[lazy-doc
(lazy:xml->sxml
ip
'([go . "http://www.geneontology.org/dtds/go.dtd#"]
[rdf . "http://www.w3.org/1999/02/22-rdf-syntax-ns#"]))]
[lazy-terms (handle-go-children
(handle-top-children
(handle-top lazy-doc)))])
(fold-lazy-list (lambda (sxml acc)
(pretty-print sxml)
(void))
(void)
lazy-terms)))
;; handle-top: sxml -> (listof lazy-sxml)
(define (handle-top sxml)
(match sxml
[(list '*TOP* top-children ...)
top-children]))
;; handle-top-children: (listof lazy-sxml) -> (listof lazy-sxml)
(define (handle-top-children sxml)
(match sxml
[(list (list 'go:go go-children ...))
go-children]
[(list other-elt rest ...)
(handle-top-children rest)]))
;; handle-go-children: (listof lazy-sxml) -> (listof lazy-sxml)
(define (handle-go-children sxml)
(match sxml
[(list (list 'rdf:RDF lazy-go-terms ...) _ ...)
lazy-go-terms]))
;; fold-lazy-list: (X Y -> Y) Y (listof lazy-sxml) -> Y
;; Given the particular kind of lazy-list given by sxml's
;; lazy:xml->sxml, does a fold across its structure.
(define (fold-lazy-list f acc lazy-list)
(cond [(empty? lazy-list)
acc]
[else
(fold-lazy-list f
(f (first lazy-list) acc)
(force (second lazy-list)))]))
(print-fragments))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Through both mzscheme and mzscheme3m, I see the same unbounded memory
problem I see in my other code example, so the problem seems intrinsic in
the data structure returned by lazy:xml->sxml. I did take a closer look
at the implementation of lazy:xml->sxml, and it's full of continuation
magic that makes this a bit nontrivial to understand.
But I need to analyze what it's doing. I don't yet see why there can't be
a "right" way to fix this that preserves the laziness with the
bounded-memory constraint.
Thanks again for the SRFI references; I'll read them more thoroughly to
understand the basic problem better. Best of wishes!