[racket] Look-ahead in parser-tools?
Hello Simon,
Some observations that maybe will help:
1. Since the problem is obviously in the lexer,
you would probably prefer testing the lexer instead
of parser:
(define (test-lexer str)
(let ((p (open-input-string str)))
(port-count-lines! p)
(let loop ()
(let ((tok (position-token-token (toy-lexer p))))
(printf "~a\n" tok)
(unless (equal? tok 'eof)
(loop))))))
2. You probably do not want the aliases to contain
whitespace:
(define-lex-abbrevs
[...]
(lex:whitespace (:or #\newline #\return #\tab #\space #\vtab)))
(define toy-lexer
(lexer-src-pos
[...]
((:&
(:* (char-complement lex:whitespace))
(complement (:: any-string "is" any-string))) (token-alias lexeme))
[...]))
3. The problem with lexer is I think that it can not tell
"fact" from "alias" (because it can not look-ahead whether
there is a following "is" or not).
For example, if the input string is "remember somefact",
it gets "remember", skips the whitespace, and then
gets "somefact" as alias. To work that around, I would
change the lexer rules by putting mandatory quotes around
facts or something like that.
Best regards,
Dmitry
On 01/17/2012 05:35 AM, Simon Haines wrote:
> I've been playing around with parser-tools and am having difficulty
> expressing the following language:
>
> "remember <alias> is <email>"
> "remember <fact>"
>
> where <alias> is any string that does not contain the word 'is', <email>
> is a well-formed email address and <fact> is any string that does not
> match the previous constraints.
>
> Here's (stripped down) version of what I have so far:
> #lang racket
>
> (require parser-tools/lex
> parser-tools/yacc
> (prefix-in : parser-tools/lex-sre))
>
> (define-lex-abbrevs
> (atext (:+ (:or alphabetic (:/ #\0 #\9) (char-set
> "!#$%&'*+-/=?^_`{|}~"))))
> (dot-atom (:: atext (:* #\. atext))))
>
> (define-tokens toy-tokens (addr-spec alias fact))
> (define-empty-tokens empty-toy-tokens (eof REMEMBER IS))
>
> (define toy-lexer
> (lexer-src-pos
> ; Consume whitespace
> ((:or #\tab #\space) (return-without-pos (toy-lexer input-port)))
> ; Email addresses
> ((:: dot-atom #\@ dot-atom) (token-addr-spec lexeme))
>
> ; Commands
> ("remember" 'REMEMBER)
> ("is" 'IS)
> ; ??? what to lex here ???
> ((complement (:: any-string "is" any-string)) (token-alias lexeme))
> (any-string (token-fact lexeme))))
>
> (define toy-parser
> (parser
> (tokens toy-tokens empty-toy-tokens)
> (start start)
> (end eof)
> (error (lambda (a b c d e) (display (format "~a ~a ~a ~a ~a" a b c
> (position-offset d)
> (position-offset e)))))
> (src-pos)
> (grammar
> (start (() #f)
> ((REMEMBER alias IS addr-spec) `(alias ,$2 ,$4))
> ((REMEMBER fact) `(fact ,$2))))))
>
> ; test
> (define (test str)
> (let ((p (open-input-string str)))
> (port-count-lines! p)
> (toy-parser (lambda () (toy-lexer p)))))
>
> The problem I'm having is that the 'fact' lexer rule always matches
> without giving a chance for the other rules to attempt a match. Perhaps
> it is my ignorance with BNF. Can this language be expressed in this way?
> An alternative I've thought of is to create a lexer rule to just match
> "remember" then pass the port to another lexer that tries to look for
> "is" or (eof) and munge the result into a token. Alternatively I could
> try to regex the <alias>, <email> or <fact> clauses out and parse them
> separately, but I'd like to compose this toy parser into a larger one if
> possible. Yet I feel there is a simple technique here that I've missed
> in my ignorance. Any ideas?
> Many thanks, Simon.
>
>
>
>
> ____________________
> Racket Users list:
> http://lists.racket-lang.org/users