[racket] Question about parser-tools/lex

From: Greg Hendershott (greghendershott at gmail.com)
Date: Thu Oct 18 15:23:49 EDT 2012

Hi, Philippe.

In your third case I actually would expect your lexer to return the
same thing as in your second case.

(collect-tokens "4 a") => (list (token-NUM 4) (token-ID 'a))
(collect-tokens "4a")  => (list (token-NUM 4) (token-ID 'a))

Because you're having the lexer consume/discard the whitespace.

If you changed the lexer to return whitespace as a token, then I would
expect something like (say):

(collect-tokens "4a")  => (list (token-NUM 4) WS (token-ID 'a))

And then I'd expect to handle that issue in your parser, not in the
lexer. Because in general I think you want the lexer to focus on
making tokens, and the parser to focus on their order, grouping, and
meaning?

(However please take my suggestion with a huge grain of salt, because
the first time I ever used _any_ lexer and parser, was using Racket's
about a month ago. :)  This list has many people who know much
better.)

Best regards,
Greg

On Thu, Oct 18, 2012 at 2:57 PM, Philippe Mechaï
<philippe.mechai at gmail.com> wrote:
> On Thu, Oct 18, 2012 at 12:05:19PM -0600, Danny Yoo wrote:
>> > The first two tests behave as expected but I would have expected the third
>> > to fail. I understand that it does not, given the way the lexer is
>> > implemented, but I cannot figure out how to change this behavior.
>>
>> Can you make those expectations explicit?  I do not know what you want
>> the value or behavior to be from the test expressions below.
>>
>>
>> You can use the rackunit test suite library
>> (http://docs.racket-lang.org/rackunit/) to make these expectations
>> more explicit.
>>
>>
>> For example, let's say that we add a require to rackunit as well as a
>> small helper function to pull all the tokens out of a string:
>>
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>> (require rackunit)
>>
>> ;; collect-tokens: string -> (listof token)
>> ;; Grabs all the tokens we can out of the tokenization of instr.
>> (define (collect-tokens instr)
>>   (call-with-input-string
>>    instr
>>    (lambda (ip)
>>      (define producer (lambda () (sample-lexer ip)))
>>      (for/list ([token (in-producer producer 'EOF)])
>>        token))))
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>
>>
>> With this setup, we can express tests such as:
>>
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>> ;; TEST 1
>> (check-exn exn:fail? (lambda () (collect-tokens "*")))
>>
>> ;; TEST 2
>> (check-equal? (collect-tokens "4 a") (list (token-NUM 4) (token-ID 'a)))
>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>
>>
>> These assert what we want the behavior to be in a way that allows for
>> automatic testing.  If we break the behavior of the tokenizer, these
>> tests will yell at us.  Otherwise, they'll stay silent.
>>
>>
>> What do you you want the behavior of the lexer to be when it hits
>> "4a"?  I'm being serious when I say I do not know!  It could either be
>> an error, or maybe you want an ID with '4a as its content.  Or maybe
>> you want two separate tokens.  Which do you want?
>
> Hi Danny,
>
> First of all, thank you for your very complete answer.
>
> I am sorry I was not clear enough, things seemed obvious to me given the attached sample lexer.
> Note that the lexer I wrote is more complicated and I found that it didn't behave properly when I started writing unit tests, so I wrote a minimal example to exhibit this behavior that seems strange to me.
>
> Anyway, back to your questions and using your testing code, this is the behavior I expect:
>
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
> ;; Test 1
> (check-exn exn:fail? (lambda () (collect-tokens "*")))
>
> ;; Test 2
> (check-equal? (collect-tokens "4 a") (list (token-NUM 4) (token-ID 'a)))
>
> ;; Test 3
> (check-exn exn:fail? (lambda () (collect-tokens "4a")))
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
> I thought that, given the way the NUM and ID tokens are defined (resp. only digits, only letters), the third test should pass...it does not.
>
> Thanks again for your time.
>
> Regards,
> Philippe Mechaï
> ____________________
>   Racket Users list:
>   http://lists.racket-lang.org/users


Posted on the users mailing list.