[plt-scheme] xexprs and sxml
On Mon, 2006-01-09 at 14:21 -0500, Anton van Straaten wrote:
> One thing which HTMLPrag handles is HTML entity representation and
> generation, which Dave mentioned: a list of the form (& entity) in the
> SXML will be converted to the corresponding HTML entity. Since HTMLPrag
> is geared towards permissive parsing of HTML, I assume it handles
> parsing of HTML containing entities, too, if that's needed.
Well not permissive enough... :)
I just did a simple change because Yahoo was using unescaped ampersands
(poor html) that was tripping up htmlprag. With this fix htmlprag+sxml
was able to process over +8,000 companies from Yahoo's Finance Sector
and Industry section and extract what I needed without further changes.
The change was something like this: when parsing text and you get an
ampersand and then hit whitespace before seeing a terminating semicolon
then assume it was an & and not an entity.
I should send it off to Neil.
823c823,825
< (make-shtml-entity name)))))
---
> (if (c-semi?)
> (make-shtml-entity name)
> (string-append "&" name))))));; RPR
Poor html with unescaped & without a terminating semi-colon. Add it
back.