[plt-scheme] RSS Parser Help!

From: Brent Fulgham (bfulg at pacbell.net)
Date: Thu Jan 10 18:43:19 EST 2008

I've been playing with some of the exercises in "Programming Collective Intelligence" in DrScheme.  I started by trying to extend Untyped's Del.icio.us API implementation to use some of the RSS-based commands available in the Python implementation, but I'm having some trouble parsing the returned RSS.

Unfortunately, I can't seem to get the RSS returned in a form that is amenable to handling via the wonderful webit-style XML matching.

Is anyone aware of a working RSS parser for Scheme?  Everything I've found via a quick search is involved in *building* an RSS feed.  I wish to gobble one up and extract certain elements from it.

If I use the HtmlPrag packages html->sxml, I don't seem to get SXML that can be easily parsed.  See the below example and input/output examples (sorry for the length of the post -- please skip down as I include some text at the very bottom):

=== Test Case =======================================
(module pltdelicious mzscheme
  (require (planet "htmlprag.ss" ("neil" "htmlprag.plt"))
           (lib "url.ss" "net")
           (lib "string.ss")
           (lib "match.ss")
           (prefix srfi_ (lib "1.ss" "srfi")))
  
  (provide url->sxml)

  (define (url->sxml url)
    (let ((rss (html->sxml (get-pure-port (string->url url)))))
      (srfi_filter (lambda (s) (not (string? s))) rss)))
 )
=================================================

Attempting to use the above snippet generates a bunch of SXML that is not acceptable to Webit's parser:

> (require pltdelicious)
> (url->sxml "http://del.icio.us/rss/")

Will turn this:

======= Received RSS  ==============================
<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns="http://purl.org/rss/1.0/"
 xmlns:content="http://purl.org/rss/1.0/modules/content/"
 xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
 xmlns:dc="http://purl.org/dc/elements/1.1/"
 xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
 xmlns:admin="http://webns.net/mvcb/"
>


<title>del.icio.us hotlist</title>
http://del.icio.us/
<description></description>
<items>
 <rdf:Seq>
  
  
  
  
  
  
  
  
  
  
  
  
 </rdf:Seq>
</items>
</channel>

=================================================


into this:

======= Example SXML Output =============================
(*TOP*
 (*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
 (rdf
  (@
   (xmlns:rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#")
   (xmlns "http://purl.org/rss/1.0/")
   (xmlns:content "http://purl.org/rss/1.0/modules/content/")
   (xmlns:taxo "http://purl.org/rss/1.0/modules/taxonomy/")
   (xmlns:dc "http://purl.org/dc/elements/1.1/")
   (xmlns:syn "http://purl.org/rss/1.0/modules/syndication/")
   (xmlns:admin "http://webns.net/mvcb/"))
  "\n"
  "\n"
  (channel
    (@ (about "http://del.icio.us/"))
    "\n"
    (title "del.icio.us hotlist")
    "\n"
    (link)
    "http://del.icio.us/"
    "\n"
    (description)
    "\n"
    (items
     "\n"
     " "
     (seq
      "\n"
      "  "
      (li (@ (resource "http://www.doodle.ch/main.html")))
      "\n"
      "  "
      (li (@ (resource "http://google-mania.net/archives/858")))
      "\n"
      "  "
 [ ... etc ... ]
=================================================


If I attempt to pass this through the XML parser, it complains about the "(*PI*" term, later it will choke on the linefeed elements, etc.

Can someone suggest a proper approach?

Thanks,

-Brent








Posted on the users mailing list.