[plt-scheme] xexprs and sxml

From: Ray Racine (rracine at adelphia.net)
Date: Tue Jan 10 00:04:30 EST 2006

On Mon, 2006-01-09 at 14:21 -0500, Anton van Straaten wrote:
> One thing which HTMLPrag handles is HTML entity representation and 
> generation, which Dave mentioned: a list of the form (& entity) in the 
> SXML will be converted to the corresponding HTML entity.  Since HTMLPrag 
> is geared towards permissive parsing of HTML, I assume it handles 
> parsing of HTML containing entities, too, if that's needed.

Well not permissive enough... :) 

I just did a simple change because Yahoo was using unescaped ampersands
(poor html) that was tripping up htmlprag.  With this fix htmlprag+sxml
was able to process over +8,000 companies from Yahoo's Finance Sector
and Industry section and extract what I needed without further changes.

The change was something like this: when parsing text and you get an
ampersand and then hit whitespace before seeing a terminating semicolon
then assume it was an & and not an entity.

I should send it off to Neil.
823c823,825
<                                (make-shtml-entity name)))))
---
>                                (if (c-semi?)
>                                    (make-shtml-entity name)
>                                    (string-append "&" name))))));; RPR
Poor html with unescaped & without a terminating semi-colon. Add it
back.






Posted on the users mailing list.