[plt-scheme] SCRIPT elements & evaporating CDATA sections [was RE: html in servlets]

From: Anton van Straaten (anton at appsolutions.com)
Date: Sun May 16 20:43:41 EDT 2004

I wrote:

> You can do something similar in plain code without XML boxes:
>
> `(script ([type "text/javascript"])
>    ,(make-comment "
>       if (a < b)
>         a = b;"))
>
> This isn't ideal.  I'll post a separate message about that.

This turns out to be a messy business.  To prevent characters in script code
from being escaped, make-comment is probably the most pragmatic solution,
other than using external script files.  (Unless I missed a way to protect
content from being escaped.)

Make-comment doesn't work in the case of read-xml, though.  In that case,
you can use an HTML <!-- comment --> to wrap a script, but this requires
that read-comments be enabled via (read-comments #t).

I have a suggestion for a way to address this more consistently and (in some
cases) conveniently, although I don't think it ought to be a high priority.
Since I went to the trouble of looking into it, though, I went ahead and
wrote it up.

Here's a table of some of the ways to create xexprs (rows) vs. the ways to
wrap script code (columns), with the cells specifying whether a combination
succeeds or fails with xml.ss:

                 CDATA     <!-- script -->  make-comment
                ---------  ---------------  ------------
XML box:         fail [1]     fail [3]         OK
read-xml:        fail [1]     OK   [4]         n/a
direct xexpr:    fail [2]     fail [3]         OK

[1] Entities escaped
[2] CDATA section not creatable; entities would be escaped anyway
[3] Comment stripped
[4] Requires (read-comments #t)

The XHTML DTD says that the SCRIPT element has PCDATA content, but browser
script engines can't handle HTML entity references in script source code.
If entities are escaped in script source code, the script will fail in a
browser.  This is described at:
http://www.w3.org/TR/xhtml1/#h-4.8

The suggested solution is to wrap the script in a CDATA section.  This
doesn't work with xml.ss, though, because the CDATA wrapper is eliminated at
read time, so when the XML is regenerated, characters are escaped.

This means that some strictly conforming XHTML will not survive being read
in and written out by xml.ss.  Here's a demonstration of this:

  (define x
   (read-xml/element
    (open-input-string
     "<script type='text/javascript'><![CDATA[ if (a < b)
b; ]]></script>")))

  (display (xml->xexpr x))
     ;=> (script ((type "text/javascript")) " if (a < b) b; ")      ; good!
  (newline)
  (write-xml/content x)
     ;=> <script type="text/javascript"> if (a &lt; b) b; </script> ; bad!

Something similar probably applies to XSL documents, which can also contain
embedded script.

The source code to "blame" here (using Robby's favorite term ;) is in
collects/xml/private/reader.ss:

  [(#\[) (read-char in)
   (unless (string=? (read-string 6 in) "CDATA[")
     (lex-error in pos "expected CDATA following <["))
   (let ([data (lex-cdata-contents in pos)])
     (make-pcdata start (pos) data))]

If the CDATA wrapper were retained until an xexpr is converted to XML, so
that the contained content would not be escaped at that time, it would
become possible to use CDATA sections to wrap script code, both in XML
boxes, as well as in XHTML which is read via read-xml.  I don't know whether
this might have an effect on other possible uses of CDATA, but it shouldn't.

This doesn't address the situation with xexprs in source code, though,
because there doesn't seem to be a way to create CDATA elements in xexprs.
If the change related to CDATA wrappers were made, then it would make sense
to have a constructor for CDATA elements to allow something like this:

  `(script ([type "text/javascript"]) ,(make-cdata "if (a < b) b;"))

Just to clarify, the CDATA should still be dropped when this is written as
XML, e.g. for the above xexpr, write-xml/content should produce this:

  <script type="text/javascript">if (a < b) b;</script>

There's still a kind of inconsistency here, since reading and writing XHTML
doesn't preserve CDATAs; but it's no worse than the current situation which
drops any CDATAs earlier in the process.

With these changes, all three cells in the CDATA column in the earlier table
would read "OK", which would provide at least one consistent way to embed
script code in xexprs.

This still isn't a perfect solution, since e.g. read-xml can still fail on
ordinary HTML with script, as opposed to XHTML.  However, fixing it more
thoroughly would require more intelligence in xml.ss; e.g. it would need to
treat elements like SCRIPT specially in the HTML context.  The best way to
do that might be DTD-directed, which is of course not so trivial.

Anton



Posted on the users mailing list.