[racket] Regex's and utf-8

From: Harry Spier (vasishtha.spier at gmail.com)
Date: Fri Jul 27 12:30:33 EDT 2012

Previous message: [racket] Sintrapara styling?
Next message: [racket] Formating Printing of Real Number
Messages sorted by: [date] [thread] [subject] [author]

Would it be possible (or would it be a good idea) for character
regex's to have a mode option "strict" or "not-strict" that would
throw an error if its input character stream contained non utf-8
characters when in strict mode.

One possible use is this.
Its real easy to accidently apply a character regex to a bytestring
(when you meant to apply a byte-string regex to a bytestream) and run
test cases and  think its working OK.
I.e. to write:
(regexp-match-positions* #rx"[^ÿ]+" #"...input byte string...")
when you meant
(regexp-match-positions* #rx#"[^ÿ]+" #". . . input byte string...")

For example this appears to work:
> (integer->char 255)
#\ÿ
> (regexp-match-positions* #rx"[^ÿ]+" #"abcÿabc")
'((0 . 3) (4 . 7))

BUT
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx".+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
>
Having a "strict" mode would show up this error.

Thanks,
Harry Spier

Posted on the users mailing list.

Previous message: [racket] Sintrapara styling?
Next message: [racket] Formating Printing of Real Number
Messages sorted by: [date] [thread] [subject] [author]