[plt-scheme] Re: Some sort of documentation tool from the toplevel?

From: Matt Jadud (mcj4 at kent.ac.uk)
Date: Thu Jul 7 02:05:11 EDT 2005

> I guess I'm looking for feedback on what I can do to make this actually
> work; one of my main problem now seems to be robust HTML scraping issues.
> 
> At the moment, I'm just using the html and xml collections provided as
> part of PLT Scheme to extract content from the HTML-ified reference
> documentation, but the parsers seem particularly unhappy about the
> non-well-formedness in there.  Should I be using a different set of
> parsers?

You have several choices for your HTML scraping.

1. WebIt's 'xml-match'
http://celtic.benderweb.net/webit/

Jim Bender's WebIt framework gives you some nifty pattern matching tools 
for working with XML expressions.

Eg.

(xml-match xexpr
   [(strong ,text)
    (printf "Found this text: ~a~n" text)])

would be a snippet that would match values of xexpr like

(strong "Hi there!")

but not

(em (strong "Hi there!"))

So, the library is "fragile" in that you have to handle your own 
recursion through the tree. Given that the docs may change HTML format, 
this is probably a relatively tedious approach, as it relies on the 
document structure being static.

2. XML query languages
A more robust approach might be to use an XML query language. This 
approach lets you write (pardon the generalization) "SQL-like" queries 
over an (S)XML document. This way, you can say something like "give me 
all the nodes that are wrapped in the <strong> tag."

Jim Bender's library (above) contains an XML query language of this nature

http://celtic.benderweb.net/webit/docs/xquery-pre/

and there is the SXPath library provided by Oleg's SSAX/SXML implementation

http://okmij.org/ftp/Scheme/xml.html

(and, you may get lost/enjoy other things found from the root of that 
site at http://okmij.org/ftp/).

3. HTMLPrag
All of that said, though, perhaps you should look at Neil's 
web-scraper-helper-thinger.

http://www.neilvandyke.org/htmlprag/
http://planet.plt-scheme.org/#webscraperhelper.plt1.1

Again, that might take some of the difficulty out of crafting the SXPath 
queries to pull a page apart.

All of these approaches are going to stretch your knowledge of Scheme in 
one way or another, and there certainly may be other ways to go about 
(permissively?) parsing HTML/XML in PLT Scheme. These were the three 
that came to mind, and my apologies if I've left anything out or 
mis-attributed any work.

Good luck!
Matt



Posted on the users mailing list.