[plt-scheme] Recommendations for parsing HTML

From: Patrick Lozzi (patricklozzi at gmail.com)
Date: Thu Dec 4 00:57:20 EST 2008

It appears I'm really at a loss without the planet's htmlprag library, I
tried to set the file to R5RS but it wouldn't allow me to use the planet's
htmlprag library as it came up with "reference to undefined identifier:
require"... mustn't the file be set to module in order to use require?
That's the impression I'm getting.  In other words, whenever I received this
error in the past, I realized the current file wasn't set to module, so
simply setting it to module corrected this error... but if I set it back to
module, I'm back at square one with the mutable cons cells problem that
plagues > v4 versions combined with htmlprag.

In summary, here's what I have been trying and I'm open to your suggestions:
  * I put the htmlprag.ss file locally, which is set to module... if I
reload it(F5), I get: "expand: unbound identifier in module in: set-cdr!"
  * If this same file is r5rs, then I get: "define-syntax: not allowed in an
expression context in: (define-syntax %htmlprag:testeez (syntax-rules () ((_
x ...) (error "Tests disabled."))))"
  * In the calling file, I tried to call the library from the planet with
the require statement but this calling file had to be set to module, as far
as I know... and the error I received was the same set-cdr! as above.
  * If I set the calling file to r5rs, I further cannot 'load' the local
file and I receive the define-syntax error stated above.
  * I rolled back to < v4, specifically 372 and had the same define-syntax
errors as above.

Can you recommend either a solution to this problem or maybe a different way
to parse html than this library altogether?  I've tried a few other
libraries, like SSAX from here:
  http://okmij.org/ftp/Scheme/xml.html#HTML-parser
but this produces undefined errors in the provided source, such as symbols
being used but nowhere in the file are they defined... and I checked for
possible dependencies and didn't find any.

I'm trying to collect all available urls in the html and queue them for
analysis.  The html should be allowed to be ill-constructed, or in other
words, I would like to acquire a library that is permissive of a few missing
html tags if the html is indeed missing them, which was my original
intention with SSAX(permissive parser), and further I had hoped I could get
htmlprag to work...

Thank you for your time and I look forward to your ideas,
-Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20081204/4ea58337/attachment.html>

Posted on the users mailing list.