[plt-scheme] RSS Parser Help!
I've been playing with some of the exercises in "Programming Collective Intelligence" in DrScheme. I started by trying to extend Untyped's Del.icio.us API implementation to use some of the RSS-based commands available in the Python implementation, but I'm having some trouble parsing the returned RSS.
Unfortunately, I can't seem to get the RSS returned in a form that is amenable to handling via the wonderful webit-style XML matching.
Is anyone aware of a working RSS parser for Scheme? Everything I've found via a quick search is involved in *building* an RSS feed. I wish to gobble one up and extract certain elements from it.
If I use the HtmlPrag packages html->sxml, I don't seem to get SXML that can be easily parsed. See the below example and input/output examples (sorry for the length of the post -- please skip down as I include some text at the very bottom):
=== Test Case =======================================
(module pltdelicious mzscheme
(require (planet "htmlprag.ss" ("neil" "htmlprag.plt"))
(lib "url.ss" "net")
(lib "string.ss")
(lib "match.ss")
(prefix srfi_ (lib "1.ss" "srfi")))
(provide url->sxml)
(define (url->sxml url)
(let ((rss (html->sxml (get-pure-port (string->url url)))))
(srfi_filter (lambda (s) (not (string? s))) rss)))
)
=================================================
Attempting to use the above snippet generates a bunch of SXML that is not acceptable to Webit's parser:
> (require pltdelicious)
> (url->sxml "http://del.icio.us/rss/")
Will turn this:
======= Received RSS ==============================
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
xmlns:admin="http://webns.net/mvcb/"
>
<title>del.icio.us hotlist</title>
http://del.icio.us/
<description></description>
<items>
<rdf:Seq>
</rdf:Seq>
</items>
</channel>
=================================================
into this:
======= Example SXML Output =============================
(*TOP*
(*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
(rdf
(@
(xmlns:rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
(xmlns "http://purl.org/rss/1.0/")
(xmlns:content "http://purl.org/rss/1.0/modules/content/")
(xmlns:taxo "http://purl.org/rss/1.0/modules/taxonomy/")
(xmlns:dc "http://purl.org/dc/elements/1.1/")
(xmlns:syn "http://purl.org/rss/1.0/modules/syndication/")
(xmlns:admin "http://webns.net/mvcb/"))
"\n"
"\n"
(channel
(@ (about "http://del.icio.us/"))
"\n"
(title "del.icio.us hotlist")
"\n"
(link)
"http://del.icio.us/"
"\n"
(description)
"\n"
(items
"\n"
" "
(seq
"\n"
" "
(li (@ (resource "http://www.doodle.ch/main.html")))
"\n"
" "
(li (@ (resource "http://google-mania.net/archives/858")))
"\n"
" "
[ ... etc ... ]
=================================================
If I attempt to pass this through the XML parser, it complains about the "(*PI*" term, later it will choke on the linefeed elements, etc.
Can someone suggest a proper approach?
Thanks,
-Brent