[racket] se-path* returning multiple strings when tag contains XML entities

From: Giacomo Ritucci (giacomo.ritucci at gmail.com)
Date: Sun Dec 15 12:52:29 EST 2013

Thanks Jay, string-append* is really handy here.

Another hint came from Matthew Butterick that pointed me to a message from
Matthias Felleisen that suggested to use match (
http://lists.racket-lang.org/users/archive/2013-June/058426.html)

I experimented a bit with the following example that combines a simple but
not trivial XML structure, whitespace and entities:

https://gist.github.com/rjack/7968318

(Any feedback is highly appreciated. For example, Jay mentioning
string-append* allowed me to get rid of all (apply string-append ...))

Honestly, my first thought has been "That's a overly difficult approach to
a simple query on XML data".

Thoughts:

1. eliminate-whitespace was key to successfully use match, I wish I found
it earlier
2. match patterns and list operations are really difficult to read (and
write) compared to the equivalent xpath expression
 3. it would be great if the XML library could provide helper functions
(something like xe->string and xe-string=?)

Is there some interest to polish this example so it can be turned into a
tutorial or a guide for the Racket XML library documentation? From a newbie
point of view this way of querying XML is not obvious.

Feedback, fixes and suggestions are highly appreciated.

Thanks again,
Giacomo



On Tue, Dec 10, 2013 at 12:45 AM, Jay McCarthy <jay.mccarthy at gmail.com>wrote:

> Hi Giacomo,
>
> I think I would do this:
>
> (define (xe->string n)
>     (string-append* (rest (rest n))))
>
>   (check-equal? (map xe->string (se-path*/list '(bands) xe))
>                 '("Derek & the Dominos" "Nick Cave & the Bad Seeds"))
>
> Because you want the children of "bands" and you want to turn each one
> into a string.
>
>
> On Sat, Dec 7, 2013 at 6:30 PM, Giacomo Ritucci
> <giacomo.ritucci at gmail.com> wrote:
> > Hi Jay,
> >
> > thanks for your reply.
> >
> > Unfortunately I can't find a way in my code to detect that in the
> resulting
> > list from se-path*/list
> >
> >
> >     '("Derek " "&" " the Dominos" "Nick Cave " "&" " the Bad Seeds")
> >
> > the first three elements should be actually treated as a single string
> and
> > so the last three.
> >
> > Is there a common idiom in Racket to extract a list of values from an XML
> > collection, in a way that works with & and other entities?
> >
> > Thanks in advance.
> >
> >
> > On Mon, Dec 2, 2013 at 9:27 PM, Jay McCarthy <jay.mccarthy at gmail.com>
> wrote:
> >>
> >> Hi Giacomo,
> >>
> >> First, the question is not really about se/list, because if you look
> >> at the xexpr you're giving it, the "name" node has three string
> >> children:
> >>
> >> '(bands () (name () "Derek " "&" " the Dominos") (name () "Nick Cave "
> >> "&" " the Bad Seeds"))
> >>
> >> And se/list* gives you these children all appended together. If you
> >> got the name nodes themselves, then you could concatenate their
> >> children.
> >>
> >> Second, there real question is about why parsing XML works like that.
> >> If you look at this:
> >>
> >> (define xs
> >>   "<bands><name>Derek & the Dominos</name><name>Nick Cave &
> >> the Bad Seeds</name></bands>")
> >> (define x
> >>   (read-xml/document (open-input-string xs)))
> >> x
> >>
> >> Then you'll see that the core is that name doesn't have a single piece
> >> of PCDATA. It has three, one of which is an entity.
> >>
> >> I don't consider this an error in the XML parser, but a consequence of
> >> XML entities that might not be obvious: they are their only nodes in
> >> the list of children of the parent node.
> >>
> >> Jay
> >>
> >>
> >> On Sun, Dec 1, 2013 at 8:36 AM, Giacomo Ritucci
> >> <giacomo.ritucci at gmail.com> wrote:
> >> > Hi Racket Users,
> >> >
> >> > I'm using se-path*/list to extract values from an XML collection but I
> >> > found
> >> > a strange behaviour when the extracted values contain entities.
> >> >
> >> > For example, given the following XML:
> >> >
> >> > <bands>
> >> >     <name>Derek & the Dominos</name>
> >> >     <name>Nick Cave & the Bad Seeds</name>
> >> > </bands>
> >> >
> >> > when I extract a list of band names with (se-path*/list '(name) xe)
> I'd
> >> > expect this result:
> >> >
> >> >     '("Derek & the Dominos" "Nick Cave & the Bad Seeds")
> >> >
> >> > but what I actually receive is:
> >> >
> >> >     '("Derek " "&" " the Dominos" "Nick Cave " "&" " the Bad Seeds")
> >> >
> >> > Is this the intended behaviour? How can I overcome this and make
> >> > se-path*/list return one string for tag?
> >> >
> >> > Here's my test code, I'm running Racket v5.3.6 on Linux x86_64 and
> maybe
> >> > I'm
> >> > doing overlooking something because I'm new to Racket.
> >> >
> >> > Thank you in advance!
> >> >
> >> > Best regards,
> >> > Giacomo
> >> >
> >> > #lang racket
> >> >
> >> > (require xml
> >> >          xml/path)
> >> >
> >> > (define xe (string->xexpr "<bands><name>Derek & the
> >> > Dominos</name><name>Nick Cave & the Bad Seeds</name></bands>"))
> >> >
> >> > (module+ test
> >> >   (require rackunit)
> >> >
> >> >   ;; what I get
> >> >   (check-equal? (se-path*/list '(name) xe)
> >> >                 '("Derek " "&" " the Dominos" "Nick Cave " "&" " the
> Bad
> >> > Seeds"))
> >> >
> >> >   ;; what I'd expect
> >> >   (check-equal? (se-path*/list '(name) xe)
> >> >                 '("Derek & the Dominos" "Nick Cave & the Bad Seeds")))
> >> >
> >> > ____________________
> >> >   Racket Users list:
> >> >   http://lists.racket-lang.org/users
> >> >
> >>
> >>
> >>
> >> --
> >> Jay McCarthy <jay at cs.byu.edu>
> >> Assistant Professor / Brigham Young University
> >> http://faculty.cs.byu.edu/~jay
> >>
> >> "The glory of God is Intelligence" - D&C 93
> >
> >
>
>
>
> --
> Jay McCarthy <jay at cs.byu.edu>
> Assistant Professor / Brigham Young University
> http://faculty.cs.byu.edu/~jay
>
> "The glory of God is Intelligence" - D&C 93
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20131215/03ebca97/attachment.html>

Posted on the users mailing list.