[plt-scheme] Unicode on the cheap

From: Joe Marshall (jrm at ccs.neu.edu)
Date: Mon Jan 26 10:25:55 EST 2004

Robby Findler <robby at cs.uchicago.edu> writes:

> Can you explain to the unenlightened Schemer what UTF-8 really is
> about? (or perhaps provide a pointer?). It's hard to understand the
> level of detail in your message without some kind of introduction to
> UTF-8.....

For those that want the absolute bare minimum, here it is:

Unicode can be thought of as a character set with (expt 2 21)
characters (technically, some of the things in Unicode are not
`characters', so they are called `code points').  If you don't wish to
communicate with other applications, you can represent these any way
you wish, but if you want to communicate with other applications, you
have to agree on the representation.  The Unicode Consortium defines
several different encodings.  Each encoding has advantages and
drawbacks.  Here are the most popular:

UTF-32LE  Each code point is a 32-bit value that is held as 4
          sequential bytes with the least significant byte first
          (little endian).

UTF-32BE  Each code point is a 32-bit value that is held as 4
          sequential bytes with the most significant byte first
          (big endian).

UTF-32    Each code point is a 32-bit value.  If serialized into
          bytes, this is ambiguous, but Unicode allows for a
          `byte-order mark' (BOM) to be placed at the front of the
          character stream.

UTF-8     Each code point is encoded as 1-6 bytes as follows:
          If the most significant bit of the first byte is zero, then
          it is the only byte and the code is the value of the next
          seven bits.  Otherwise, the most significant bit is one, and
          the number of additional bytes is encoded in UNARY in the
          immedately following bits, followed by a zero bit.  Each
          additional byte must have a `10' in the most significant
          position, leaving six bits available for the code.

          Example:  The copyright character
                    U+00A9 = 0000 0000 1010 1001

          110 00010  10 101001 = #xC2 #xA9

          Example:  The not-equal-to character 
                    U+2260 = 0010 0010 0110 0000

          1110 0010  10 001001  10 100000 = 0xE2 0x89 0xA0

  (A quick digression.  Unicode 2.0 had only (expt 2 16) characters.
  The obvious encoding is as 2-bytes, and this was called the
  `Universal Character Set'.  Of course, 2-bytes is enough for any
  character...NOT.  Fortunately, a good chunk of the 16-bit space had
  not yet been assigned, so codes from #xD800 through #xDFFF were
  reserved to allow an escape mechanism (surrogate characters).

  Ok, this is a major kludge, but the thing is this:  The vast
  majority of glyphs in major use (other than Chinese) fit in the
  16-bit space.  So if you are willing to punt, you can support *most*
  of Unicode with 16-bit characters.  This is grudgingly allowed.)

UCS-2     A 16-bit subset of Unicode that doesn't include the
          surrogate sequences.

  `Surrogate characters' are pairs of 16-bit codes where the first
  byte must have D8, D9, DA, or DB as the top nybble and the second
  byte must have DC, DD, DE, or DF as the top.  This gives you just
  enough bits to encode the parts of unicode you can't normally get
  to.

UTF-16LE  Code points U+000000 through U+00CFFE are represented by
          encoding only the bottom 16 bits in a pair of bytes, least
          significant byte first.  Surrogate characters are used for
          the high codes.

UTF-16BE  Code points U+000000 through U+00CFFE are represented by
          encoding only the bottom 16 bits in a pair of bytes, most
          significant byte first.  Surrogate characters are used for
          the high codes.

UTF-16    Code points U+000000 through U+00CFFE are represented by
          encoding only the bottom 16 bits.  Surrogate characters are
          used for the high codes.

-----------

The advantage of UTF-8 is this:

  - ASCII codes are unchanged.

  - Most latin characters (which are mostly ascii) are one byte long,
    but few extra european characters are two bytes.

  - Byte ordering isn't ambiguous.

  - Strings are null terminated.

  - Because of the funky encoding, it is self-synchronizing.  You
    cannot do a string search for `foo' and accidentally end up in the
    `middle' of a character.

The disadvantages are:

  - You cannot index into a string without scanning from the start.

  - Eastern languages (Indian, Chinese, etc.) require 3-bytes per
    character to encode rather than one byte.  Believe it or not, this
    is percieved by some as an *intentional* slight against eastern
    cultures.


-------------------------

A question for Matthew:  Why not UTF-16?

--
~jrm







Posted on the users mailing list.