[plt-scheme] Unicode on the cheap

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Mon Jan 26 18:33:28 EST 2004

At Mon, 26 Jan 2004 10:25:55 -0500, Joe Marshall wrote:
> A question for Matthew:  Why not UTF-16?

Through various off-list conversations, it's clear that I need to
better explain the motivation for my proposal.

The key issue me for is distinguishing "char" from "byte". I agree that
these really should be distinguished. The real reason not to re-define
"char" is that I don't have the energy to implement this change, given
the current shape of our code.

I don't like "I've hacked myself into a corner" as the reason for a
design choice, so I've tried to rationalize, explaining how UTF-8 is
not really so awkward compared to other approaches. The rationalization
is honest; I was prepared for a long time to hack myself out of the
corner, until I seriously considered the UTF-8 option and saw how well
it could work.

We can improve PLT Scheme's Unicode support with relatively little work
by shifting to a UTF-8 intrepretation of byte strings. If it turns out
that further improvement is necessary, then we can take on the
difficult change in the future.

It may also turn out that we just need a workable connection to Unicode
(via UTF-8) in the current core, and then we can build a nice language
(where "char" = "code point") on top of that in the future. This will
require reader extensions, etc., but it's the sort of language layering
that we're always working toward.


At Sun, 25 Jan 2004 12:12:53 +0200, "Dor Kleiman" wrote:
> I believe you should enable a function call or some internal
> definition (utf-8) that would enable it, to stop some older
> applications from not working in newer versions (such as applications
> that might implement some sort of telnet or something).

Usually, we don't try to support backward compatibility at this level,
since it just increases complexity overall. I doubt, actually, that
many applications will break in this case.


At Sun, 25 Jan 2004 11:19:45 -0500, Paul Schlie wrote:
> (string-length s) -> 9  ; length in logical characters (Unicode code-points)
> (string-UTF-8 s) -> 15  ; length in physical UTF-8 code-units (bytes)
>                         ; or maybe 16 if including a terminal null marker

I've received many suggestions along these lines, and I appreciate the
feedback. But I think that breaking the relationship between `string'
and `string-length' is out-of-bounds. If a "char" is a Unicode code
point, then `string-length' should report the number of code points in
a string. But if "char" is a UTF-8 code unit, then that's what
`string-length' should report.


Matthew



Posted on the users mailing list.