[plt-scheme] Parsing html : match or regexp (beginner question)

Fri Nov 12 16:07:11 EST 2004

On Nov 11, 2004, at 4:32 PM, Thomas-Xavier MARTIN wrote:

>   For list-related administrative tasks:
>   http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>
> I'm trying to port some TCL web-scraping code to PLT-Scheme as a way 
> to gain
> more understanding of Scheme.
>
> The TCL code fetches a page on the web, looks for some tidbits in the 
> HTML
> code and then acts on the tidbits, a very standard behaviour for a
> web-scraping script.

....

> Or am I doing this all wrong ? Maybe I should read the HTML as an Xexp 
> and use
> the underlying structure instead of parsing a flat string. (Some of the
> tidbits I parse for are the external links in the HTML page)

YES!

Others have already pointed in the right direction, but I just wanted 
to bring out this point:  HTML is a string representation of structured 
data.  Recover the structured data _first_, then operate on that.  It's 
much much easer to reason correctly about structured data than it is 
about sequences of characters.

Note that I'm not insisting on X-expressions; you might well be happier 
with other structured representations--say, sxml, which is now bundled 
with the intermediate releases of Dr/MzScheme.

john clements

> Any pointers towards enlightenment would be greatly appreciated!
> -- 
> Sincerely,
> Thomas-Xavier MARTIN
> txm+plt-scheme at m4x.org
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2169 bytes
Desc: not available
URL: <http://lists.racket-lang.org/users/archive/attachments/20041112/dea39c46/attachment.p7s>