[racket] XML library: representing CDATA

From: Norman Gray (norman at astro.gla.ac.uk)
Date: Thu Jan 5 13:07:47 EST 2012

Jay, hello.

On 4 Jan 2012, at 20:53, Jay McCarthy wrote:

>> In the XML module's cdata struct, "[t]he string field is assumed to
>> be of the form <![CDATA[‹content›]]> with proper quoting of
>> ‹content›."  It's not clear that this is a very useful design of the
>> interface.
>> Principally, it makes it inconvenient to get at the <content>, and
>> requires calls to substring (or something like that) in order to
>> extract the <content> from cdata-string.
> I'm happy including a helper function that does the substring.

An alternative would be to define, say, the following.

#lang racket

(struct source (start stop)) ; dummy definition

(struct cdata source (chars)
  #:guard (λ (start stop chars type)
            (cond ((regexp-match #rx"^<!\\[CDATA\\[(.*)]]>$" chars)
                   => (λ (m)
                        (values start stop (list-ref m 1))))
                  (else (values start stop chars)))))
(define (cdata-string cdata)
  (string-append "<![CDATA[" (cdata-chars cdata) "]]>"))

(define c1 (cdata #f #f "cdata1"))
(define c2 (cdata #f #f "<![CDATA[cdata2]]>"))

(printf "c1: ~a & ~a~%" (cdata-chars c1) (cdata-string c1))
(printf "c2: ~a & ~a~%" (cdata-chars c2) (cdata-string c2))
c1: cdata1 & <![CDATA[cdata1]]>
c2: cdata2 & <![CDATA[cdata2]]>

This would entail corresponding changes to the XML writer, but would be coherent and backward compatible, in the sense that something that was illegal before would become legal, but nothing hitherto legal would become illegal.

>> Secondly, it represents low-level syntactical information which
>> should not, I think, be present in the result of a parse of an XML
>> document.  The fact that the content string originated from within a
>> CDATA section is, I think, useful to know, but only just.  Note that
>> the fact that a string or character originated within a CDATA
>> section is not part of the XML information set
>> (<http://www.w3.org/TR/xml-infoset/> Sect. 2.6, and Appx D point
>> 19).  Supposing (which would be sturdily defensible) that xexprs
>> should represent no more than the content of the XML information
>> set, then there would be no need for the cdata structure at all
>> (though this obviously makes escaping characters on output somewhat
>> more involved).
> I'm happy making the backwards compatible change of changing the
> reader to never produce them.

Right, so parsing "<p>Foo <![CDATA[b&r<>]]> baz</p>" would produce 

(list 'p '() "Foo " "b&r<>" " baz")


(list 'p '() "Foo b&r<> baz")

The only arguable downside to this is that the presence of a #<cdata> structure gives the caller a hint that there's something that (someone thought) needs escaping here.  However, if they're being as careful as they should be about escaping before outputting, then this won't make any difference.

>> It's also completely counterintuitive: the documentation of this
>> struct is only three sentences long, and when reading it I _still_
>> managed to elide the explanation that the CDATA line-noise actually
>> had to be included in the string, presumaly because it seemed so
>> obvious that it wouldn't.
> The sentence is there because it is non-intuitive. I don't know any
> other way to say it. The XML collect doesn't insert the wrapper, it
> assumes it is already there.

Perhaps a big "NOTE:" at the beginning of the second paragraph would draw attention to it.

>> Side-issue regarding the wording of the documentation: it's not
>> completely clear what "proper quoting of content" means.  I presume
>> it means purely racket-quoting of the string contents, and doesn't
>> refer to XML quoting at all.  Thus (cdata #f #f "<![CDATA[\"&]]>")
>> would be acceptable in principle (it is acceptable in fact).
> It refers to the fact that "]]>" cannot appear in the content.

We may be at cross-purposes, then, but it's still not clear what "proper quoting" refers to, since there's no scope for quoting the contents of CDATA sections.  If you want to include "]]>" within/near a CDATA section (perhaps you're writing about CDATA sections, or you have a taste for esoteric smilies: 8]]> "gleeful person with handlebar moustache"), then you'd have to do something like <![CDATA[esoteric smilie: 8]]]]><![CDATA[> "gleeful"]]>

I think it would be reasonable for write-xexpr and friends to simply throw an error if they find a "]]>" in CDATA content, leaving it up to the creator of the xexpr to handle this corner case themself.

>> Is there any chance of a (admittedly backward-incompatible) change
>> to this part of the interface?  I doubt that the cdata structure is
>> very extensively used.
> I believe its main use is in including Javascript output where XML
> quoting will cause stuff like "1 < 2" to fail to compile in most
> browsers. In that case, it is very important that the CDATA tags not
> be there (i.e. we WANT invalid XML) because browsers will break on
> that too.

That's the broad sort of situation where I'm using it.  Looking at Eli's Javascript example, I think that's a case where the module can properly leave such two-language-at-a-time hacking to the (poor) author, and blithely output <![CDATA[...]]> in all cases.

Best wishes,


Norman Gray  :  http://nxg.me.uk

Posted on the users mailing list.