<div dir="ltr"><br><div class="gmail_quote">On Fri, Jul 25, 2008 at 2:45 AM, Simon Michael <<a href="mailto:simon@joyful.com">simon@joyful.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I've hit a new problem: parsing a HTML block, which has matching open and close tags. In regexps it looks like "<([^>]+)>.*?</\\2>". I haven't found any way to mimic the \2 reference with parser-tools. What to do ?<br>
</blockquote>
<br></div>
After more reading, I think parser-tools and similar bottom-up parsers can't do this job - it requires a top-down parser, such as <a href="http://www.lshift.net/blog/2005/08/22/json-for-mzscheme-and-a-portable-packrat-parsing-combinator-library" target="_blank">http://www.lshift.net/blog/2005/08/22/json-for-mzscheme-and-a-portable-packrat-parsing-combinator-library</a> or <a href="http://www.lshift.net/blog/2008/07/01/ometa-for-scheme" target="_blank">http://www.lshift.net/blog/2008/07/01/ometa-for-scheme</a> .<div>
<div></div><div class="Wj3C7c"></div></div></blockquote><div><br>Hi - <br><br>do you mean regular expression along cannot parse "<([^>]+>.*?</\\2>"? If so you are correct as regular grammar is not recursive. HTML is probably more amenable to top-down parsing, but bottom-up should also be doable - I've not tried it, but here's someone who has - <a href="http://1997.webhistory.org/www.lists/www-html.1995q1/0019.html">http://1997.webhistory.org/www.lists/www-html.1995q1/0019.html</a>, and you might be able to extract his grammars and designs. <br>
<br>Note that (real) html is difficult to parse, especially if you want to parse like IE or Firefox, where tags can be optional, mismatched, or missing. In such case using an external library might be easier if you are just trying to get the job done. But treating html as xml ought to simplify the matters quite a bit.<br>
<br>Cheers,<br>yc<br><br></div></div></div>