[racket] lex-error report while reading HTML using module XML
On Thu, Dec 13, 2012 at 9:17 AM, Stephen Bloch <sbloch at adelphi.edu> wrote:
>
> On Dec 12, 2012, at 11:42 PM, Haiwei Zhou wrote:
>
>> In HTML the <img> tag has no end tag.
>
> Not exactly true. In XHTML (i.e. HTML >= 4.0, IIRC), it SHOULD have an end tag -- as should every other tag in XHTML. You can do this either with
> <img src="blah.blah"></img>
> or, briefer, with
> <img src="blah.blah"/>
>
> However, most or all browsers accept Web pages with certain common tags unterminated: <img>, <p>, <br>, <li>, etc. and there's a reasonable argument that Racket's XML library should be capable of accepting them too.
This is not really an accurate characterization of HTML, XHTML, and tags.
First, the standard that modern browsers follow is called HTML5 or
just HTML [1,2] and is not an XML dialect. In HTML, some tags do
*not* have a close tag (such as <img>). Further, the parser for HTML
is required to handle ill-formed HTML (such as unclosed <p> tags) in a
specified way. However, these two situations are distinct, and the
latter is an error recovery mechanism for invalid HTML. You can see
the distinction between them in a validator [3].
XHTML is a syntax for writing HTML which has somewhat different rules
for things like close tags, and is only used by browsers when content
is served with a particular MIME type. There's a discussion of XHTML
and HTML here [4].
Because of these issues, building an HTML parser that works with
in-the-wild HTML documents is complicated endeavor, that isn't really
helped by having an XML parser. That's why Jay suggested using a
dedicated HTML parser for such documents (there are separate issues
with *generating* HTML using the `xml` library, but those aren't
really relevant here).
Sam
[1] http://dev.w3.org/html5/spec/Overview.html
[2] www.whatwg.org/C
[3] http://validator.nu/
[4] http://www.whatwg.org/specs/web-apps/current-work/multipage/introduction.html#html-vs-xhtml