[racket] Regex's and utf-8
Would it be possible (or would it be a good idea) for character
regex's to have a mode option "strict" or "not-strict" that would
throw an error if its input character stream contained non utf-8
characters when in strict mode.
One possible use is this.
Its real easy to accidently apply a character regex to a bytestring
(when you meant to apply a byte-string regex to a bytestream) and run
test cases and think its working OK.
I.e. to write:
(regexp-match-positions* #rx"[^ÿ]+" #"...input byte string...")
when you meant
(regexp-match-positions* #rx#"[^ÿ]+" #". . . input byte string...")
For example this appears to work:
> (integer->char 255)
#\ÿ
> (regexp-match-positions* #rx"[^ÿ]+" #"abcÿabc")
'((0 . 3) (4 . 7))
BUT
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx".+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
>
Having a "strict" mode would show up this error.
Thanks,
Harry Spier