[plt-scheme] progress with parser-tools

From: Simon Michael (simon at joyful.com)
Date: Thu Jul 24 18:11:02 EDT 2008

Day 3 with parser-tools, and I understand a bit more. Some issues I've 

- discovered I wasn't seeing all plt list messages in thunderbird, so 
fixed that and now I've read parser discussions back to 2002. Yay!

- the test harness I copied was causing me problems, since I didn't 
realize it was designed to keep calling the parser until #f is returned.

- in the lex-abbrevs, lexer and parser I was confused about plain 
strings vs. string regexps vs. sre regexps vs. the lexer's own 
operators. I think I have it all straight now. A small suggestion: 
replace the word "trigger" with the more familiar "pattern" at 
http://docs.plt-scheme.org/parser-tools/Lexers.html ? Also, being able 
to use plain old string regexps where it makes sense would make things 
much easier for newcomers.

- I don't yet have a great understanding of what is best done in the 
lex-abbrevs vs. the lexer vs. the parser grammar. Any rules of thumb here ?

Below is my latest attempt at a parser 
http://daringfireball.net/projects/markdown/syntax . (I know this can be 
done with a hand-written parser or partially with regexps, but I want a 
more declarative technique such as parser-tools. I also found the parser 
combinators collection which looks worth a try, later.)

I've hit a new problem: parsing a HTML block, which has matching open 
and close tags. In regexps it looks like "<([^>]+)>.*?</\\2>". I haven't 
found any way to mimic the \2 reference with parser-tools. What to do ?


; markdown parser

(require parser-tools/yacc
          (prefix-in : parser-tools/lex-sre)

(define-tokens md-tokens (LINE HTMLBLOCK))
(define-empty-tokens empty-tokens (BLANKLINE NEWLINE EOF))

   (newline #\newline)
   (non-newline (:- any-char newline))
   (non-blank-line (:+ non-newline))
;  (html-block ) ; hmm. how to match open/close tag ?

(define md-lexer
    [(eof)  'EOF]
    ["\n\n" 'BLANKLINE]
    ["\n"   'NEWLINE]
    [non-blank-line (token-LINE lexeme)]
;   [html-block (token-HTMLBLOCK lexeme)]

(define md-parser
    ;(debug "parse.debug")
;    (precs ; left, right, nonassoc
;     (left BLANKLINE)
;     )
    (error (lambda (ok? name value) (printf "could not parse: ~a\n" name)))
    (tokens md-tokens empty-tokens)
    (start document)
    (end EOF)
     (document   [(blocks) (reverse $1)]
                 [() null]
     (blocks [(blocks block) (cons $2 $1)]
             [(block) (list $1)]
     (block [(paragraph) $1]
;           [(htmlblock) $1]
     (paragraph [(lines BLANKLINE) (reverse $1)]
                [(lines) (reverse $1)]
;;     (htmlblock [(HTMLBLOCK) 'htmlblock]
;;                )
     (lines [(lines line) (cons $2 $1)]
            [(line) (list $1)]
     (line  [(LINE NEWLINE) $1]
            [(LINE) $1]

; utils

(define (string->lex-list s lexer) (port->lex-list (open-input-string s) 
(define (lex-string s) (string->lex-list s md-lexer))
(define (parse p lexer parser) (parser (lambda () (lexer p))))
(define (parse-string s) (parse (open-input-string s) md-lexer md-parser))

Posted on the users mailing list.