[plt-scheme] Unicode on the cheap

Sun Jan 25 05:12:53 EST 2004

I believe you should enable a function call or some internal definition
(utf-8) that would enable it, to stop some older applications from not
working in newer versions (such as applications that might implement
some sort of telnet or something).

ifconfig

-----Original Message-----
From: Matthew Flatt [mailto:mflatt at cs.utah.edu] 
Sent: Sunday, January 25, 2004 3:30 AM
To: plt-scheme at list.cs.brown.edu
Subject: [plt-scheme] Unicode on the cheap

  For list-related administrative tasks:
  http://list.cs.brown.edu/mailman/listinfo/plt-scheme

After several aborted attempts in the past year to add Unicode support
to MzScheme/MrEd (and after wading through the recent deluge of
Unicode-related messages on comp.lang.scheme and the SRFI-50 list), I'm
ready to settle for minimal changes: a UTF-8 interpretation of byte
strings, plus support in the editor at the level of code points.

Separating "char" from "byte" still seems like the right thing in
principle, but the notion of "Unicode character" is so complex that
practical definitions of "char" end up approximating. I start to think
that "UTF-8 code unit" is as good an approximation as any. To put it
another way, I suspect that "char" may be a useless datatype.

Maybe that's all wrong, but I'm still pretty sure that anything other
than UTF-8 is more trouble than it's worth for us.

A proposed conversion plan follows.

Matthew

----------------------------------------

Core changes:

 * We re-define the character-string interpretation of "string" to be a
   UTF-8 encoding of Unicode, instead of a Latin-1 encoding.

   A PLT Scheme "char" is therefore a UTF-8 code unit, much like a Java
   "char" is a UTF-16 code unit (not to be confused with a "Unicode
   character", or even a "Unicode code point").

   In addition to UTF-8 code units, though, we also have the chars
   #\376 and #\377, which are not legal in a UTF-8 sequence. With these
   two additions, the char <-> byte isomorphism remains.

   This interpretation affects the way strings are used as filenames
   under Windows, the way they are used for labels in a GUI, the way
   they are drawn on the screen, and the way they are interpreted for
   locale-specific comparisions. That's about it (which is the beauty
   of UTF-8), but it's enough to warrant a major-version increment.

 * We change the editor to support Unicode code points as items, and
   have the editor read and write text files using UTF-8 (instead of
   Latin-1). More generally, strings going into and out of an editor
   (e.g., through the `insert' and `get-text' methods) will be in
   UTF-8.

   Keyboard events will sometimes report the pressed "key" as a string
   (UTF-8 encoding of a code point) if it doesn't fit into a char.

These parts fit together in the usual UTF-8 way: If you type a Chinese
character into DrScheme, then the MzScheme reader will see a sequence
of bytes in the 128-253 range; none of the "chars" in 128-253 are
special, so the stream will parse as a symbol. When MzScheme later
prints the symbol (as a result, in an error message, or whatever),
DrScheme's editor will decode the UTF-8 byte stream and draw Chinese
characters.

----------------------------------------

The most visible implication of the UTF-8 approach is that, in
DrScheme's REPL,

   (string-length "$%^#")

will produce a value between 4 to 20, depending on the letters in place
of "$", "%", "^", and "#". For Chinese letters, the result will tend to
be 12. For non-English Latin-based letters, the result will tend to be
4 or 5.

MzScheme will provide functions such as `string->code-point-vector',
which converts a string to a vector of numbers, and
`string-code-point-length', which returns the number of code points in
the UTF-8 decoding of a string. So

   (string-code-point-length "$%^#")

will always produce 4 in DrScheme. (Perhaps there are better names for
these functions.)

Regexp matching will continue to work on chars/bytes. Consequently,

   (regexp-match #rx"$%^#" s)

will work as expected for any "$", etc., but not

   (regexp-match #rx"[$%^#]" s)

since "$" might correspond to multiple chars (i.e., code units) in the
string. Of course, this problem can be attributed to abusing strings in
the first place for writing regular expressions.

A problem that's perhaps more significant than either of the above:
some files and streams may use the current locale encoding of
characters, rather than UTF-8 (e.g., GB for Chinese). I propose that we
continue to ignore this problem, and generally rely on
people/environments to switch to UTF-8. MzScheme can provide some
conversion functions for manual conversion (discussed further below).

----------------------------------------

Position and column counting for a port will be sensitive to UTF-8. For
example, reading #\302 followed by #\251 will increment the position
and column by 1, instead of 2.

----------------------------------------

For language-specific ordering and case folding, we already have a
locale system in place. As far as I can tell, the underlying
functionality is about the best we can do, no matter how close "char"
is to a "Unicode character". Still, there are some issues.

The locale currently controls two things:

 * The interpretation of a byte stream as characters.

 * The ordering and case relationship among characters.

For example, `string-locale-ci=?' currently interprets a byte string in
terms of the current locale's encoding, and then compares. Our new
system should split the control. The `locale' functions should
interpret the byte stream as UTF-8, independent of the locale, and rely
on the locale only for ordering and case relationsship.

(The actual comparison will sometimes require converting to the
locale's encoding, but that's internal. Meanwhile, under Windows and
Mac OS, MzScheme should probably use the Windows/Mac native locale
support rather than the Unix-style wrapper functions, but that's a
small adjustment.)

We should add new functions, roughly `locale-string->utf8-string' and
`utf8-string->locale-string', for converting between representations.
These can be used to overcome the default interpretation of byte
streams as UTF-8.

Functions like `char-locale-ci=?', `char-locale-ci<?', and
`char-locale-upper-case?' are nonsense, because it makese no sense to
operate on UTF-8 code units. We should drop them and add a few
functions like `string-locale-upper-case?'.

----------------------------------------

When rendering a sequence of Unicode code points, some drawing
toolboxes can handle "combining characters" (shouldn't that be
"combining code points"?) to form a single glyph for a sequence of code
points, and some cannot. We'll generally let the toolbox do whatever it
does, and it will work reliably for single-code-point characters (which
are the kind that the editor supports).

For consistency, the editor needs to render each code points as a
separate glyph, not matter the capabilities of the underlying toolbox,
and other programs may need similar functionality. A flag to
`draw-text' and `get-text-extent' will enable per-code-point rendering
in general. Conveniently, the existing "big-chars?" flag (which has
never quite worked) becomes obsolete with a UTF-8 interpretation of
strings, so it can be recycled.