[racket] XML library: representing CDATA

From: Jay McCarthy (jay.mccarthy at gmail.com)
Date: Wed Jan 4 15:53:04 EST 2012


On Tue, 3 Jan 2012 21:41:28 +0000,Norman Gray <norman at astro.gla.ac.uk> mumbled:

> Greetings.

> In the XML module's cdata struct, "[t]he string field is assumed to
> be of the form <![CDATA[‹content›]]> with proper quoting of
> ‹content›."  It's not clear that this is a very useful design of the
> interface.

> Principally, it makes it inconvenient to get at the <content>, and
> requires calls to substring (or something like that) in order to
> extract the <content> from cdata-string.

I'm happy including a helper function that does the substring.

> Secondly, it represents low-level syntactical information which
> should not, I think, be present in the result of a parse of an XML
> document.  The fact that the content string originated from within a
> CDATA section is, I think, useful to know, but only just.  Note that
> the fact that a string or character originated within a CDATA
> section is not part of the XML information set
> (<http://www.w3.org/TR/xml-infoset/> Sect. 2.6, and Appx D point
> 19).  Supposing (which would be sturdily defensible) that xexprs
> should represent no more than the content of the XML information
> set, then there would be no need for the cdata structure at all
> (though this obviously makes escaping characters on output somewhat
> more involved).

I'm happy making the backwards compatible change of changing the
reader to never produce them.

> It's also completely counterintuitive: the documentation of this
> struct is only three sentences long, and when reading it I _still_
> managed to elide the explanation that the CDATA line-noise actually
> had to be included in the string, presumaly because it seemed so
> obvious that it wouldn't.

The sentence is there because it is non-intuitive. I don't know any
other way to say it. The XML collect doesn't insert the wrapper, it
assumes it is already there.

> Side-issue regarding the wording of the documentation: it's not
> completely clear what "proper quoting of content" means.  I presume
> it means purely racket-quoting of the string contents, and doesn't
> refer to XML quoting at all.  Thus (cdata #f #f "<![CDATA[\"&]]>")
> would be acceptable in principle (it is acceptable in fact).

It refers to the fact that "]]>" cannot appear in the content.

> Is there any chance of a (admittedly backward-incompatible) change
> to this part of the interface?  I doubt that the cdata structure is
> very extensively used.

I believe its main use is in including Javascript output where XML
quoting will cause stuff like "1 < 2" to fail to compile in most
browsers. In that case, it is very important that the CDATA tags not
be there (i.e. we WANT invalid XML) because browsers will break on
that too.

Jay

> Best wishes,

> Norman


> -- 
> Norman Gray  :  http://nxg.me.uk
> SUPA School of Physics and Astronomy, University of Glasgow, UK


> ____________________
>   Racket Users list:
>   http://lists.racket-lang.org/users

--
Jay McCarthy <jay.mccarthy at gmail.com>
Assistant Professor / Brigham Young University
http://faculty.cs.byu.edu/~jay

"The glory of God is Intelligence" - D&C 93


Posted on the users mailing list.