[plt-scheme] Recommendations for parsing HTML

From: Robby Findler (robby at cs.uchicago.edu)
Date: Thu Dec 4 01:46:42 EST 2008

Did you try read-html-as-xml?


On Wed, Dec 3, 2008 at 11:57 PM, Patrick Lozzi <patricklozzi at gmail.com> wrote:
> It appears I'm really at a loss without the planet's htmlprag library, I
> tried to set the file to R5RS but it wouldn't allow me to use the planet's
> htmlprag library as it came up with "reference to undefined identifier:
> require"... mustn't the file be set to module in order to use require?
> That's the impression I'm getting.  In other words, whenever I received this
> error in the past, I realized the current file wasn't set to module, so
> simply setting it to module corrected this error... but if I set it back to
> module, I'm back at square one with the mutable cons cells problem that
> plagues > v4 versions combined with htmlprag.
> In summary, here's what I have been trying and I'm open to your suggestions:
>   * I put the htmlprag.ss file locally, which is set to module... if I
> reload it(F5), I get: "expand: unbound identifier in module in: set-cdr!"
>   * If this same file is r5rs, then I get: "define-syntax: not allowed in an
> expression context in: (define-syntax %htmlprag:testeez (syntax-rules () ((_
> x ...) (error "Tests disabled."))))"
>   * In the calling file, I tried to call the library from the planet with
> the require statement but this calling file had to be set to module, as far
> as I know... and the error I received was the same set-cdr! as above.
>   * If I set the calling file to r5rs, I further cannot 'load' the local
> file and I receive the define-syntax error stated above.
>   * I rolled back to < v4, specifically 372 and had the same define-syntax
> errors as above.
> Can you recommend either a solution to this problem or maybe a different way
> to parse html than this library altogether?  I've tried a few other
> libraries, like SSAX from here:
>   http://okmij.org/ftp/Scheme/xml.html#HTML-parser
> but this produces undefined errors in the provided source, such as symbols
> being used but nowhere in the file are they defined... and I checked for
> possible dependencies and didn't find any.
> I'm trying to collect all available urls in the html and queue them for
> analysis.  The html should be allowed to be ill-constructed, or in other
> words, I would like to acquire a library that is permissive of a few missing
> html tags if the html is indeed missing them, which was my original
> intention with SSAX(permissive parser), and further I had hoped I could get
> htmlprag to work...
> Thank you for your time and I look forward to your ideas,
> -Patrick
> _________________________________________________
>  For list-related administrative tasks:
>  http://list.cs.brown.edu/mailman/listinfo/plt-scheme

Posted on the users mailing list.