[racket] Question about parser-tools/lex
Hi, Philippe.
In your third case I actually would expect your lexer to return the
same thing as in your second case.
(collect-tokens "4 a") => (list (token-NUM 4) (token-ID 'a))
(collect-tokens "4a") => (list (token-NUM 4) (token-ID 'a))
Because you're having the lexer consume/discard the whitespace.
If you changed the lexer to return whitespace as a token, then I would
expect something like (say):
(collect-tokens "4a") => (list (token-NUM 4) WS (token-ID 'a))
And then I'd expect to handle that issue in your parser, not in the
lexer. Because in general I think you want the lexer to focus on
making tokens, and the parser to focus on their order, grouping, and
meaning?
(However please take my suggestion with a huge grain of salt, because
the first time I ever used _any_ lexer and parser, was using Racket's
about a month ago. :) This list has many people who know much
better.)
Best regards,
Greg
On Thu, Oct 18, 2012 at 2:57 PM, Philippe Mechaï
<philippe.mechai at gmail.com> wrote:
> On Thu, Oct 18, 2012 at 12:05:19PM -0600, Danny Yoo wrote:
>> > The first two tests behave as expected but I would have expected the third
>> > to fail. I understand that it does not, given the way the lexer is
>> > implemented, but I cannot figure out how to change this behavior.
>>
>> Can you make those expectations explicit? I do not know what you want
>> the value or behavior to be from the test expressions below.
>>
>>
>> You can use the rackunit test suite library
>> (http://docs.racket-lang.org/rackunit/) to make these expectations
>> more explicit.
>>
>>
>> For example, let's say that we add a require to rackunit as well as a
>> small helper function to pull all the tokens out of a string:
>>
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>> (require rackunit)
>>
>> ;; collect-tokens: string -> (listof token)
>> ;; Grabs all the tokens we can out of the tokenization of instr.
>> (define (collect-tokens instr)
>> (call-with-input-string
>> instr
>> (lambda (ip)
>> (define producer (lambda () (sample-lexer ip)))
>> (for/list ([token (in-producer producer 'EOF)])
>> token))))
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>
>>
>> With this setup, we can express tests such as:
>>
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>> ;; TEST 1
>> (check-exn exn:fail? (lambda () (collect-tokens "*")))
>>
>> ;; TEST 2
>> (check-equal? (collect-tokens "4 a") (list (token-NUM 4) (token-ID 'a)))
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>
>>
>> These assert what we want the behavior to be in a way that allows for
>> automatic testing. If we break the behavior of the tokenizer, these
>> tests will yell at us. Otherwise, they'll stay silent.
>>
>>
>> What do you you want the behavior of the lexer to be when it hits
>> "4a"? I'm being serious when I say I do not know! It could either be
>> an error, or maybe you want an ID with '4a as its content. Or maybe
>> you want two separate tokens. Which do you want?
>
> Hi Danny,
>
> First of all, thank you for your very complete answer.
>
> I am sorry I was not clear enough, things seemed obvious to me given the attached sample lexer.
> Note that the lexer I wrote is more complicated and I found that it didn't behave properly when I started writing unit tests, so I wrote a minimal example to exhibit this behavior that seems strange to me.
>
> Anyway, back to your questions and using your testing code, this is the behavior I expect:
>
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
> ;; Test 1
> (check-exn exn:fail? (lambda () (collect-tokens "*")))
>
> ;; Test 2
> (check-equal? (collect-tokens "4 a") (list (token-NUM 4) (token-ID 'a)))
>
> ;; Test 3
> (check-exn exn:fail? (lambda () (collect-tokens "4a")))
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
> I thought that, given the way the NUM and ID tokens are defined (resp. only digits, only letters), the third test should pass...it does not.
>
> Thanks again for your time.
>
> Regards,
> Philippe Mechaï
> ____________________
> Racket Users list:
> http://lists.racket-lang.org/users