[plt-scheme] Unicode on the cheap

From: Robby Findler (robby at cs.uchicago.edu)
Date: Sat Jan 24 20:54:28 EST 2004

Can you explain to the unenlightened Schemer what UTF-8 really is 
about? (or perhaps provide a pointer?). It's hard to understand the 
level of detail in your message without some kind of introduction to 


On Jan 24, 2004, at 7:30 PM, Matthew Flatt wrote:

> After several aborted attempts in the past year to add Unicode support
> to MzScheme/MrEd (and after wading through the recent deluge of
> Unicode-related messages on comp.lang.scheme and the SRFI-50 list), I'm
> ready to settle for minimal changes: a UTF-8 interpretation of byte
> strings, plus support in the editor at the level of code points.
> Separating "char" from "byte" still seems like the right thing in
> principle, but the notion of "Unicode character" is so complex that
> practical definitions of "char" end up approximating. I start to think
> that "UTF-8 code unit" is as good an approximation as any. To put it
> another way, I suspect that "char" may be a useless datatype.
> Maybe that's all wrong, but I'm still pretty sure that anything other
> than UTF-8 is more trouble than it's worth for us.
> A proposed conversion plan follows.
> Matthew
> ----------------------------------------
> Core changes:
>  * We re-define the character-string interpretation of "string" to be a
>    UTF-8 encoding of Unicode, instead of a Latin-1 encoding.
>    A PLT Scheme "char" is therefore a UTF-8 code unit, much like a Java
>    "char" is a UTF-16 code unit (not to be confused with a "Unicode
>    character", or even a "Unicode code point").
>    In addition to UTF-8 code units, though, we also have the chars
>    #\376 and #\377, which are not legal in a UTF-8 sequence. With these
>    two additions, the char <-> byte isomorphism remains.
>    This interpretation affects the way strings are used as filenames
>    under Windows, the way they are used for labels in a GUI, the way
>    they are drawn on the screen, and the way they are interpreted for
>    locale-specific comparisions. That's about it (which is the beauty
>    of UTF-8), but it's enough to warrant a major-version increment.
>  * We change the editor to support Unicode code points as items, and
>    have the editor read and write text files using UTF-8 (instead of
>    Latin-1). More generally, strings going into and out of an editor
>    (e.g., through the `insert' and `get-text' methods) will be in
>    UTF-8.
>    Keyboard events will sometimes report the pressed "key" as a string
>    (UTF-8 encoding of a code point) if it doesn't fit into a char.
> These parts fit together in the usual UTF-8 way: If you type a Chinese
> character into DrScheme, then the MzScheme reader will see a sequence
> of bytes in the 128-253 range; none of the "chars" in 128-253 are
> special, so the stream will parse as a symbol. When MzScheme later
> prints the symbol (as a result, in an error message, or whatever),
> DrScheme's editor will decode the UTF-8 byte stream and draw Chinese
> characters.
> ----------------------------------------
> The most visible implication of the UTF-8 approach is that, in
> DrScheme's REPL,
>    (string-length "$%^#")
> will produce a value between 4 to 20, depending on the letters in place
> of "$", "%", "^", and "#". For Chinese letters, the result will tend to
> be 12. For non-English Latin-based letters, the result will tend to be
> 4 or 5.
> MzScheme will provide functions such as `string->code-point-vector',
> which converts a string to a vector of numbers, and
> `string-code-point-length', which returns the number of code points in
> the UTF-8 decoding of a string. So
>    (string-code-point-length "$%^#")
> will always produce 4 in DrScheme. (Perhaps there are better names for
> these functions.)
> Regexp matching will continue to work on chars/bytes. Consequently,
>    (regexp-match #rx"$%^#" s)
> will work as expected for any "$", etc., but not
>    (regexp-match #rx"[$%^#]" s)
> since "$" might correspond to multiple chars (i.e., code units) in the
> string. Of course, this problem can be attributed to abusing strings in
> the first place for writing regular expressions.
> A problem that's perhaps more significant than either of the above:
> some files and streams may use the current locale encoding of
> characters, rather than UTF-8 (e.g., GB for Chinese). I propose that we
> continue to ignore this problem, and generally rely on
> people/environments to switch to UTF-8. MzScheme can provide some
> conversion functions for manual conversion (discussed further below).
> ----------------------------------------
> Position and column counting for a port will be sensitive to UTF-8. For
> example, reading #\302 followed by #\251 will increment the position
> and column by 1, instead of 2.
> ----------------------------------------
> For language-specific ordering and case folding, we already have a
> locale system in place. As far as I can tell, the underlying
> functionality is about the best we can do, no matter how close "char"
> is to a "Unicode character". Still, there are some issues.
> The locale currently controls two things:
>  * The interpretation of a byte stream as characters.
>  * The ordering and case relationship among characters.
> For example, `string-locale-ci=?' currently interprets a byte string in
> terms of the current locale's encoding, and then compares. Our new
> system should split the control. The `locale' functions should
> interpret the byte stream as UTF-8, independent of the locale, and rely
> on the locale only for ordering and case relationsship.
> (The actual comparison will sometimes require converting to the
> locale's encoding, but that's internal. Meanwhile, under Windows and
> Mac OS, MzScheme should probably use the Windows/Mac native locale
> support rather than the Unix-style wrapper functions, but that's a
> small adjustment.)
> We should add new functions, roughly `locale-string->utf8-string' and
> `utf8-string->locale-string', for converting between representations.
> These can be used to overcome the default interpretation of byte
> streams as UTF-8.
> Functions like `char-locale-ci=?', `char-locale-ci<?', and
> `char-locale-upper-case?' are nonsense, because it makese no sense to
> operate on UTF-8 code units. We should drop them and add a few
> functions like `string-locale-upper-case?'.
> ----------------------------------------
> When rendering a sequence of Unicode code points, some drawing
> toolboxes can handle "combining characters" (shouldn't that be
> "combining code points"?) to form a single glyph for a sequence of code
> points, and some cannot. We'll generally let the toolbox do whatever it
> does, and it will work reliably for single-code-point characters (which
> are the kind that the editor supports).
> For consistency, the editor needs to render each code points as a
> separate glyph, not matter the capabilities of the underlying toolbox,
> and other programs may need similar functionality. A flag to
> `draw-text' and `get-text-extent' will enable per-code-point rendering
> in general. Conveniently, the existing "big-chars?" flag (which has
> never quite worked) becomes obsolete with a UTF-8 interpretation of
> strings, so it can be recycled.
