[plt-scheme] Unicode, take 2
I take back half of what I wrote the first time around (in "Unicode on
the cheap"). It looks like a better compromise is to change "char" =
"unicode code point", but preserve "port" = "byte port". Details below.
Matthew
----------------------------------------
New proposal:
* "char" means "unicode code point", though extended to include up to
(sub1 (expt 2 31)). In particular, `integer->char' produces a
character for every exact integer from 0 to #x7FFFFFFF, except
#xFFFF, #xFFFE, and #xD800 to #xDFFF (which are not code points and
never will be).
The `byte-string->string/utf8' and `string->byte-string/utf8'
functions convert between the two kinds of strings via UTF-8.
There's also `byte-string->string/utf8-permissive', which produces a
given character in place of bad encoding sequences.
A general `byte-string-convert' interface lets you convert among
different encodings in a byte-string, including UTF-8 and the
current locale's encoding. The conversion interface can deal with
input that ends mid-encoding, so it can be used for conversion on
streams, too. (The converter uses iconv() where available.)
Internally, strings are encoded as UCS-4, but symbols are encoded in
UTF-8.
* Add a `bytes-...' operation for most every `string-...' operation.
The `byte?' predicate returns true for exact integers in [0,255].
* "port" still means "byte port". Rename port operations like
`read-string-avail!' to `read-bytes-avail!'.
Character operations on a port, such as `read-char' and
`read-string', are defined in terms of a UTF-8 parsing/writing of
the port's byte stream. (With a custom-port wrapper and the
byte-string conversion functions, other decodings can be
implemented.)
Position and column counting for a port will be sensitive to UTF-8.
For example, reading #o302 followed by #o251 will increment the
position and column by 1, instead of 2.
* Perhaps surprisingly, paths are represented by byte strings, not
strings. All functions that consume a pathname accept a string and
implicitly convert it (via UTF-8) to a byte-string pathname.
A Unix pathname is a byte string. Typically, you want to interpret a
path according to the current locale's encoding when you print it,
but there's no guarantee that the path is well-formed using the
current locale's encoding.
Under Windows, where a pathname is a UTF-16 string, MzScheme
internally converts to and from byte strings via UTF-8. A byte
string that is not a UTF-8 encoding will never correspond to a
pathname under windows.
* `regexp' will work on strings and `byte-regexp' will work on
byte strings.
A regexp can be matched against a byte-string (or port), in which
case the byte-string (or port) is interpreted as a UTF-8 encoding.
Similarly, a byte-regexp can be matched against a string, in which
case the string is encoded via UTF-8 before matching.
* A hash before a string makes it a byte-string literal:
(string->list "hi") = '(#\h #\i)
(bytes->list #"hi") = '(104 105)
Similarly, #rx"...." is still a regexp, while #rx#"...." is a
byte-regexp.
* All of the `char-whitespace?', `char-alphabetic?', etc. functions
are defined in accordance with SRFI-14. New functions include
`char-title-case?', `char-blank?', `char-graphic?' `char-symbolic?',
and `char-titlecase'.
* The built-in string functions remain locale-independent (as in
SRFI-13), and `string-locale=?', etc. provide locale-sensitive
comparisons. The `string-locale-upcase' and `string-locale-downcase'
functions provide locale-sensitive case conversion. No
locale-sensitive character operations are provided.
* Case-insensitivity for symbols is consistent with SRFI-13, which
means using the 1-1 character mapping defined by the Unicode
consortium.
Number parsing recognizes only ASCII digits (and A-F/a-f) for
numbers, but all `char-whitespace?' characters are treated as
whitespace by `read'.
* MzScheme effectively assumes UTF-8 stdin and stdout. DrScheme reads
and writes text files using UTF-8.