[racket] Look-ahead in parser-tools?

From: Simon Haines (simon.haines at con-amalgamate.net)
Date: Mon Jan 16 20:35:51 EST 2012

I've been playing around with parser-tools and am having difficulty
expressing the following language:

"remember <alias> is <email>"
"remember <fact>"

where <alias> is any string that does not contain the word 'is', <email> is
a well-formed email address and <fact> is any string that does not match
the previous constraints.

Here's (stripped down) version of what I have so far:
#lang racket

(require parser-tools/lex
         (prefix-in : parser-tools/lex-sre))

  (atext (:+ (:or alphabetic (:/ #\0 #\9) (char-set
  (dot-atom (:: atext (:* #\. atext))))

(define-tokens toy-tokens (addr-spec alias fact))
(define-empty-tokens empty-toy-tokens (eof REMEMBER IS))

(define toy-lexer
   ; Consume whitespace
   ((:or #\tab #\space) (return-without-pos (toy-lexer input-port)))

   ; Email addresses
   ((:: dot-atom #\@ dot-atom) (token-addr-spec lexeme))

   ; Commands
   ("remember" 'REMEMBER)
   ("is" 'IS)

   ; ??? what to lex here ???
   ((complement (:: any-string "is" any-string)) (token-alias lexeme))
   (any-string (token-fact lexeme))))

(define toy-parser
   (tokens toy-tokens empty-toy-tokens)
   (start start)
   (end eof)
   (error (lambda (a b c d e) (display (format "~a ~a ~a ~a ~a" a b c
                                               (position-offset d)
                                               (position-offset e)))))

    (start (() #f)
           ((REMEMBER alias IS addr-spec) `(alias ,$2 ,$4))
           ((REMEMBER fact) `(fact ,$2))))))

; test
(define (test str)
  (let ((p (open-input-string str)))
    (port-count-lines! p)
    (toy-parser (lambda () (toy-lexer p)))))

The problem I'm having is that the 'fact' lexer rule always matches without
giving a chance for the other rules to attempt a match. Perhaps it is my
ignorance with BNF. Can this language be expressed in this way? An
alternative I've thought of is to create a lexer rule to just match
"remember" then pass the port to another lexer that tries to look for "is"
or (eof) and munge the result into a token. Alternatively I could try to
regex the <alias>, <email> or <fact> clauses out and parse them separately,
but I'd like to compose this toy parser into a larger one if possible. Yet
I feel there is a simple technique here that I've missed in my ignorance.
Any ideas?
Many thanks, Simon.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20120117/a761632f/attachment.html>

Posted on the users mailing list.