[plt-scheme] progress with parser-tools
Day 3 with parser-tools, and I understand a bit more. Some issues I've
encountered:
- discovered I wasn't seeing all plt list messages in thunderbird, so
fixed that and now I've read parser discussions back to 2002. Yay!
- the test harness I copied was causing me problems, since I didn't
realize it was designed to keep calling the parser until #f is returned.
- in the lex-abbrevs, lexer and parser I was confused about plain
strings vs. string regexps vs. sre regexps vs. the lexer's own
operators. I think I have it all straight now. A small suggestion:
replace the word "trigger" with the more familiar "pattern" at
http://docs.plt-scheme.org/parser-tools/Lexers.html ? Also, being able
to use plain old string regexps where it makes sense would make things
much easier for newcomers.
- I don't yet have a great understanding of what is best done in the
lex-abbrevs vs. the lexer vs. the parser grammar. Any rules of thumb here ?
Below is my latest attempt at a parser
http://daringfireball.net/projects/markdown/syntax . (I know this can be
done with a hand-written parser or partially with regexps, but I want a
more declarative technique such as parser-tools. I also found the parser
combinators collection which looks worth a try, later.)
I've hit a new problem: parsing a HTML block, which has matching open
and close tags. In regexps it looks like "<([^>]+)>.*?</\\2>". I haven't
found any way to mimic the \2 reference with parser-tools. What to do ?
Thanks,
-Simon
; markdown parser
(require parser-tools/yacc
parser-tools/lex
(prefix-in : parser-tools/lex-sre)
)
(define-tokens md-tokens (LINE HTMLBLOCK))
(define-empty-tokens empty-tokens (BLANKLINE NEWLINE EOF))
(define-lex-abbrevs
(newline #\newline)
(non-newline (:- any-char newline))
(non-blank-line (:+ non-newline))
; (html-block ) ; hmm. how to match open/close tag ?
)
(define md-lexer
(lexer
[(eof) 'EOF]
["\n\n" 'BLANKLINE]
["\n" 'NEWLINE]
[non-blank-line (token-LINE lexeme)]
; [html-block (token-HTMLBLOCK lexeme)]
))
(define md-parser
(parser
;(debug "parse.debug")
; (precs ; left, right, nonassoc
; (left BLANKLINE)
; )
(suppress)
(error (lambda (ok? name value) (printf "could not parse: ~a\n" name)))
(tokens md-tokens empty-tokens)
(start document)
(end EOF)
(grammar
(document [(blocks) (reverse $1)]
[() null]
)
(blocks [(blocks block) (cons $2 $1)]
[(block) (list $1)]
)
(block [(paragraph) $1]
; [(htmlblock) $1]
)
(paragraph [(lines BLANKLINE) (reverse $1)]
[(lines) (reverse $1)]
)
;; (htmlblock [(HTMLBLOCK) 'htmlblock]
;; )
(lines [(lines line) (cons $2 $1)]
[(line) (list $1)]
)
(line [(LINE NEWLINE) $1]
[(LINE) $1]
)
)
))
; utils
(define (string->lex-list s lexer) (port->lex-list (open-input-string s)
lexer))
(define (lex-string s) (string->lex-list s md-lexer))
(define (parse p lexer parser) (parser (lambda () (lexer p))))
(define (parse-string s) (parse (open-input-string s) md-lexer md-parser))