[plt-scheme] Unicode on the cheap
At Sat, 24 Jan 2004 21:47:43 -0600, Robby Findler wrote:
> On Jan 24, 2004, at 7:30 PM, Matthew Flatt wrote:
> > A PLT Scheme "char" is therefore a UTF-8 code unit, [...]
>
> Those two don't quite seem to reconcile, from what I've understood of
> utf-8 code points.
A "code point" is a Unicode number. A "code unit" in a particular
encoding scheme is one number used toward the encoding of code points.
> Also, is the reason that regexps don't work out quite due to the fact
> that the library we use for regexps doesn't work with utf-8 strings?
Our regexp library works in terms of bytes, byte strings, and byte
ports. This is not inherently incompatible for use with UTF-8
encodings, but there's an issue for defining regexp patterns: a UTF-8
encoding as a regexp pattern does not produce the same result as the
original code-point sequence as a regexp pattern (in a matcher whose
pattern syntax is defined in terms of code points).
> Is it possible there is a new version of the library that might?
No, because we've greatly modified the original library.
Matthew