[plt-scheme] Unicode, take 2

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Fri Feb 13 09:13:41 EST 2004

I take back half of what I wrote the first time around (in "Unicode on
the cheap"). It looks like a better compromise is to change "char" =
"unicode code point", but preserve "port" = "byte port". Details below.

Matthew

----------------------------------------

New proposal:

 * "char" means "unicode code point", though extended to include up to
   (sub1 (expt 2 31)). In particular, `integer->char' produces a
   character for every exact integer from 0 to #x7FFFFFFF, except
   #xFFFF, #xFFFE, and #xD800 to #xDFFF (which are not code points and
   never will be).

   The `byte-string->string/utf8' and `string->byte-string/utf8'
   functions convert between the two kinds of strings via UTF-8.
   There's also `byte-string->string/utf8-permissive', which produces a
   given character in place of bad encoding sequences.

   A general `byte-string-convert' interface lets you convert among
   different encodings in a byte-string, including UTF-8 and the
   current locale's encoding. The conversion interface can deal with
   input that ends mid-encoding, so it can be used for conversion on
   streams, too. (The converter uses iconv() where available.)

   Internally, strings are encoded as UCS-4, but symbols are encoded in
   UTF-8.

 * Add a `bytes-...' operation for most every `string-...' operation.
   The `byte?' predicate returns true for exact integers in [0,255].

 * "port" still means "byte port". Rename port operations like
   `read-string-avail!' to `read-bytes-avail!'.

   Character operations on a port, such as `read-char' and
   `read-string', are defined in terms of a UTF-8 parsing/writing of
   the port's byte stream. (With a custom-port wrapper and the
   byte-string conversion functions, other decodings can be
   implemented.)

   Position and column counting for a port will be sensitive to UTF-8.
   For example, reading #o302 followed by #o251 will increment the
   position and column by 1, instead of 2.

 * Perhaps surprisingly, paths are represented by byte strings, not
   strings. All functions that consume a pathname accept a string and
   implicitly convert it (via UTF-8) to a byte-string pathname.

   A Unix pathname is a byte string. Typically, you want to interpret a
   path according to the current locale's encoding when you print it,
   but there's no guarantee that the path is well-formed using the
   current locale's encoding.

   Under Windows, where a pathname is a UTF-16 string, MzScheme
   internally converts to and from byte strings via UTF-8. A byte
   string that is not a UTF-8 encoding will never correspond to a
   pathname under windows.

 * `regexp' will work on strings and `byte-regexp' will work on
   byte strings.

   A regexp can be matched against a byte-string (or port), in which
   case the byte-string (or port) is interpreted as a UTF-8 encoding.

   Similarly, a byte-regexp can be matched against a string, in which
   case the string is encoded via UTF-8 before matching.

 * A hash before a string makes it a byte-string literal:

      (string->list "hi") = '(#\h #\i)
      (bytes->list #"hi") = '(104 105)

   Similarly, #rx"...." is still a regexp, while #rx#"...." is a
   byte-regexp.

 * All of the `char-whitespace?', `char-alphabetic?', etc. functions
   are defined in accordance with SRFI-14. New functions include
   `char-title-case?', `char-blank?', `char-graphic?' `char-symbolic?',
   and `char-titlecase'.

 * The built-in string functions remain locale-independent (as in
   SRFI-13), and `string-locale=?', etc. provide locale-sensitive
   comparisons. The `string-locale-upcase' and `string-locale-downcase'
   functions provide locale-sensitive case conversion. No
   locale-sensitive character operations are provided.

 * Case-insensitivity for symbols is consistent with SRFI-13, which
   means using the 1-1 character mapping defined by the Unicode
   consortium.

   Number parsing recognizes only ASCII digits (and A-F/a-f) for
   numbers, but all `char-whitespace?' characters are treated as
   whitespace by `read'.

 * MzScheme effectively assumes UTF-8 stdin and stdout. DrScheme reads
   and writes text files using UTF-8.



Posted on the users mailing list.