[plt-scheme] Re: progress with parser-tools

From: YC (yinso.chen at gmail.com)
Date: Fri Jul 25 14:52:12 EDT 2008

On Fri, Jul 25, 2008 at 2:45 AM, Simon Michael <simon at joyful.com> wrote:

> I've hit a new problem: parsing a HTML block, which has matching open and
>> close tags. In regexps it looks like "<([^>]+)>.*?</\\2>". I haven't found
>> any way to mimic the \2 reference with parser-tools. What to do ?
>>
>
> After more reading, I think parser-tools and similar bottom-up parsers
> can't do this job - it requires a top-down parser, such as
> http://www.lshift.net/blog/2005/08/22/json-for-mzscheme-and-a-portable-packrat-parsing-combinator-libraryor
> http://www.lshift.net/blog/2008/07/01/ometa-for-scheme .
>

Hi -

do you mean regular expression along cannot parse "<([^>]+>.*?</\\2>"?  If
so you are correct as regular grammar is not recursive.  HTML is probably
more amenable to top-down parsing, but bottom-up should also be doable -
I've not tried it, but here's someone who has -
http://1997.webhistory.org/www.lists/www-html.1995q1/0019.html, and you
might be able to extract his grammars and designs.

Note that (real) html is difficult to parse, especially if you want to parse
like IE or Firefox, where tags can be optional, mismatched, or missing.  In
such case using an external library might be easier if you are just trying
to get the job done.  But treating html as xml ought to simplify the matters
quite a bit.

Cheers,
yc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20080725/d6e42c68/attachment.html>

Posted on the users mailing list.