[racket] Regex's and utf-8

From: Harry Spier (vasishtha.spier at gmail.com)
Date: Fri Jul 27 12:30:33 EDT 2012

Would it be possible (or would it be a good idea) for character
regex's to have a mode option "strict" or "not-strict" that would
throw an error if its input character stream contained non utf-8
characters when in strict mode.

One possible use is this.
Its real easy to accidently apply a character regex to a bytestring
(when you meant to apply a byte-string regex to a bytestream) and run
test cases and  think its working OK.
I.e. to write:
(regexp-match-positions* #rx"[^ÿ]+" #"...input byte string...")
when you meant
(regexp-match-positions* #rx#"[^ÿ]+" #". . . input byte string...")

For example this appears to work:
> (integer->char 255)
#\ÿ
> (regexp-match-positions* #rx"[^ÿ]+" #"abcÿabc")
'((0 . 3) (4 . 7))

BUT
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx".+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
>
Having a "strict" mode would show up this error.

Thanks,
Harry Spier


Posted on the users mailing list.