[plt-scheme] RSS Parser Help!

From: Brent Fulgham (bfulg at pacbell.net)
Date: Thu Jan 10 18:43:19 EST 2008

I've been playing with some of the exercises in "Programming Collective Intelligence" in DrScheme.  I started by trying to extend Untyped's Del.icio.us API implementation to use some of the RSS-based commands available in the Python implementation, but I'm having some trouble parsing the returned RSS.

Unfortunately, I can't seem to get the RSS returned in a form that is amenable to handling via the wonderful webit-style XML matching.

Is anyone aware of a working RSS parser for Scheme?  Everything I've found via a quick search is involved in *building* an RSS feed.  I wish to gobble one up and extract certain elements from it.

If I use the HtmlPrag packages html->sxml, I don't seem to get SXML that can be easily parsed.  See the below example and input/output examples (sorry for the length of the post -- please skip down as I include some text at the very bottom):

=== Test Case =======================================
(module pltdelicious mzscheme
  (require (planet "htmlprag.ss" ("neil" "htmlprag.plt"))
           (lib "url.ss" "net")
           (lib "string.ss")
           (lib "match.ss")
           (prefix srfi_ (lib "1.ss" "srfi")))
  (provide url->sxml)

  (define (url->sxml url)
    (let ((rss (html->sxml (get-pure-port (string->url url)))))
      (srfi_filter (lambda (s) (not (string? s))) rss)))

Attempting to use the above snippet generates a bunch of SXML that is not acceptable to Webit's parser:

> (require pltdelicious)
> (url->sxml "http://del.icio.us/rss/")

Will turn this:

======= Received RSS  ==============================
<?xml version="1.0" encoding="UTF-8"?>


<title>del.icio.us hotlist</title>


into this:

======= Example SXML Output =============================
 (*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
   (xmlns:rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
   (xmlns "http://purl.org/rss/1.0/")
   (xmlns:content "http://purl.org/rss/1.0/modules/content/")
   (xmlns:taxo "http://purl.org/rss/1.0/modules/taxonomy/")
   (xmlns:dc "http://purl.org/dc/elements/1.1/")
   (xmlns:syn "http://purl.org/rss/1.0/modules/syndication/")
   (xmlns:admin "http://webns.net/mvcb/"))
    (@ (about "http://del.icio.us/"))
    (title "del.icio.us hotlist")
     " "
      "  "
      (li (@ (resource "http://www.doodle.ch/main.html")))
      "  "
      (li (@ (resource "http://google-mania.net/archives/858")))
      "  "
 [ ... etc ... ]

If I attempt to pass this through the XML parser, it complains about the "(*PI*" term, later it will choke on the linefeed elements, etc.

Can someone suggest a proper approach?



Posted on the users mailing list.