[plt-scheme] Unicode on the cheap

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Sat Jan 24 23:29:54 EST 2004

At Sat, 24 Jan 2004 21:47:43 -0600, Robby Findler wrote:
> On Jan 24, 2004, at 7:30 PM, Matthew Flatt wrote:
> >    A PLT Scheme "char" is therefore a UTF-8 code unit, [...]
> 
> Those two don't quite seem to reconcile, from what I've understood of 
> utf-8 code points.

A "code point" is a Unicode number. A "code unit" in a particular
encoding scheme is one number used toward the encoding of code points.

> Also, is the reason that regexps don't work out quite due to the fact
> that the library we use for regexps doesn't work with utf-8 strings?

Our regexp library works in terms of bytes, byte strings, and byte
ports. This is not inherently incompatible for use with UTF-8
encodings, but there's an issue for defining regexp patterns: a UTF-8
encoding as a regexp pattern does not produce the same result as the
original code-point sequence as a regexp pattern (in a matcher whose
pattern syntax is defined in terms of code points).

> Is it possible there is a new version of the library that might?

No, because we've greatly modified the original library.

Matthew



Posted on the users mailing list.