[plt-scheme] Parsing html : match or regexp (beginner question)

From: Thomas-Xavier MARTIN (txm+plt-scheme at m4x.org)
Date: Thu Nov 11 16:32:27 EST 2004

I'm trying to port some TCL web-scraping code to PLT-Scheme as a way to gain 
more understanding of Scheme.

The TCL code fetches a page on the web, looks for some tidbits in the HTML 
code and then acts on the tidbits, a very standard behaviour for a 
web-scraping script.

I know how to fetch a page in Scheme, and I know how to act on the tidbits 
found. I was ready to use a regexp to parse the HTML page for the interesting 
tidbits (which is what the TCL code does), but I read in the Cookbook :

"Lisps in general are sort of famous for looking down on Regular Expressions. 
Other languages that lack Scheme's powerful pattern matching tend to fall 
back on regular expressions to provide some of that capability"
http://schemecookbook.org/Cookbook/RegexChapter

But the pattern matching chapter of the Cookbook is just a stub without any 
recipe or example, and I haven't been able to understand what the Help Desk 
says on the two pattern matching libraries...

So, would anybody care to tell me if I should use regexps or pattern-matching, 
and eventually point me to a good explanation (with examples, please ?) of 
pattern-matching in Scheme ?

Or am I doing this all wrong ? Maybe I should read the HTML as an Xexp and use 
the underlying structure instead of parsing a flat string. (Some of the 
tidbits I parse for are the external links in the HTML page)

Any pointers towards enlightenment would be greatly appreciated!
-- 
Sincerely,
Thomas-Xavier MARTIN
txm+plt-scheme at m4x.org



Posted on the users mailing list.