[plt-scheme] Parsing html : match or regexp (beginner question)

From: Gordon Weakliem (gordon.weakliem at gmail.com)
Date: Fri Nov 12 14:58:30 EST 2004

Since I'm the one who wrote that sentence, I'll take the blame :-) 
The best example of this on the cookbook is
WebExtractingAllLinksFromAPage [1] , though the XMLRecipeRSS[2] is a
pretty good one as well.  I also did a weblog entry that involved
using pattern matching to scrape a web page [3]

+1 to the recommendation of HtmlPrag, as well, as well as WebScraperHelper [4]

[1] http://schemecookbook.org/Cookbook/WebExtractingAllLinksFromPage
[2] http://schemecookbook.org/Cookbook/XMLRecipeRSS
[3] http://www.eighty-twenty.net/blog/urn:www-eighty-twenty-net:1253.html
[4] http://www.neilvandyke.org/webscraperhelper/




On Thu, 11 Nov 2004 22:32:27 +0100, Thomas-Xavier MARTIN
<txm+plt-scheme at m4x.org> wrote:
>   For list-related administrative tasks:
>   http://list.cs.brown.edu/mailman/listinfo/plt-scheme
> 
> I'm trying to port some TCL web-scraping code to PLT-Scheme as a way to gain
> more understanding of Scheme.
> 
> The TCL code fetches a page on the web, looks for some tidbits in the HTML
> code and then acts on the tidbits, a very standard behaviour for a
> web-scraping script.
> 
> I know how to fetch a page in Scheme, and I know how to act on the tidbits
> found. I was ready to use a regexp to parse the HTML page for the interesting
> tidbits (which is what the TCL code does), but I read in the Cookbook :
> 
> "Lisps in general are sort of famous for looking down on Regular Expressions.
> Other languages that lack Scheme's powerful pattern matching tend to fall
> back on regular expressions to provide some of that capability"
> http://schemecookbook.org/Cookbook/RegexChapter
> 
> But the pattern matching chapter of the Cookbook is just a stub without any
> recipe or example, and I haven't been able to understand what the Help Desk
> says on the two pattern matching libraries...
> 
> So, would anybody care to tell me if I should use regexps or pattern-matching,
> and eventually point me to a good explanation (with examples, please ?) of
> pattern-matching in Scheme ?
> 
> Or am I doing this all wrong ? Maybe I should read the HTML as an Xexp and use
> the underlying structure instead of parsing a flat string. (Some of the
> tidbits I parse for are the external links in the HTML page)
> 
> Any pointers towards enlightenment would be greatly appreciated!
> --
> Sincerely,
> Thomas-Xavier MARTIN
> txm+plt-scheme at m4x.org
> 
>


Posted on the users mailing list.