[plt-scheme] Parsing html : match or regexp (beginner question)
On Nov 11, 2004, at 4:32 PM, Thomas-Xavier MARTIN wrote:
> For list-related administrative tasks:
> http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>
> I'm trying to port some TCL web-scraping code to PLT-Scheme as a way
> to gain
> more understanding of Scheme.
>
> The TCL code fetches a page on the web, looks for some tidbits in the
> HTML
> code and then acts on the tidbits, a very standard behaviour for a
> web-scraping script.
....
> Or am I doing this all wrong ? Maybe I should read the HTML as an Xexp
> and use
> the underlying structure instead of parsing a flat string. (Some of the
> tidbits I parse for are the external links in the HTML page)
YES!
Others have already pointed in the right direction, but I just wanted
to bring out this point: HTML is a string representation of structured
data. Recover the structured data _first_, then operate on that. It's
much much easer to reason correctly about structured data than it is
about sequences of characters.
Note that I'm not insisting on X-expressions; you might well be happier
with other structured representations--say, sxml, which is now bundled
with the intermediate releases of Dr/MzScheme.
john clements
> Any pointers towards enlightenment would be greatly appreciated!
> --
> Sincerely,
> Thomas-Xavier MARTIN
> txm+plt-scheme at m4x.org
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2169 bytes
Desc: not available
URL: <http://lists.racket-lang.org/users/archive/attachments/20041112/dea39c46/attachment.p7s>