[plt-scheme] Unicode on the cheap
Robby Findler <robby at cs.uchicago.edu> writes:
> Can you explain to the unenlightened Schemer what UTF-8 really is
> about? (or perhaps provide a pointer?). It's hard to understand the
> level of detail in your message without some kind of introduction to
> UTF-8.....
For those that want the absolute bare minimum, here it is:
Unicode can be thought of as a character set with (expt 2 21)
characters (technically, some of the things in Unicode are not
`characters', so they are called `code points'). If you don't wish to
communicate with other applications, you can represent these any way
you wish, but if you want to communicate with other applications, you
have to agree on the representation. The Unicode Consortium defines
several different encodings. Each encoding has advantages and
drawbacks. Here are the most popular:
UTF-32LE Each code point is a 32-bit value that is held as 4
sequential bytes with the least significant byte first
(little endian).
UTF-32BE Each code point is a 32-bit value that is held as 4
sequential bytes with the most significant byte first
(big endian).
UTF-32 Each code point is a 32-bit value. If serialized into
bytes, this is ambiguous, but Unicode allows for a
`byte-order mark' (BOM) to be placed at the front of the
character stream.
UTF-8 Each code point is encoded as 1-6 bytes as follows:
If the most significant bit of the first byte is zero, then
it is the only byte and the code is the value of the next
seven bits. Otherwise, the most significant bit is one, and
the number of additional bytes is encoded in UNARY in the
immedately following bits, followed by a zero bit. Each
additional byte must have a `10' in the most significant
position, leaving six bits available for the code.
Example: The copyright character
U+00A9 = 0000 0000 1010 1001
110 00010 10 101001 = #xC2 #xA9
Example: The not-equal-to character
U+2260 = 0010 0010 0110 0000
1110 0010 10 001001 10 100000 = 0xE2 0x89 0xA0
(A quick digression. Unicode 2.0 had only (expt 2 16) characters.
The obvious encoding is as 2-bytes, and this was called the
`Universal Character Set'. Of course, 2-bytes is enough for any
character...NOT. Fortunately, a good chunk of the 16-bit space had
not yet been assigned, so codes from #xD800 through #xDFFF were
reserved to allow an escape mechanism (surrogate characters).
Ok, this is a major kludge, but the thing is this: The vast
majority of glyphs in major use (other than Chinese) fit in the
16-bit space. So if you are willing to punt, you can support *most*
of Unicode with 16-bit characters. This is grudgingly allowed.)
UCS-2 A 16-bit subset of Unicode that doesn't include the
surrogate sequences.
`Surrogate characters' are pairs of 16-bit codes where the first
byte must have D8, D9, DA, or DB as the top nybble and the second
byte must have DC, DD, DE, or DF as the top. This gives you just
enough bits to encode the parts of unicode you can't normally get
to.
UTF-16LE Code points U+000000 through U+00CFFE are represented by
encoding only the bottom 16 bits in a pair of bytes, least
significant byte first. Surrogate characters are used for
the high codes.
UTF-16BE Code points U+000000 through U+00CFFE are represented by
encoding only the bottom 16 bits in a pair of bytes, most
significant byte first. Surrogate characters are used for
the high codes.
UTF-16 Code points U+000000 through U+00CFFE are represented by
encoding only the bottom 16 bits. Surrogate characters are
used for the high codes.
-----------
The advantage of UTF-8 is this:
- ASCII codes are unchanged.
- Most latin characters (which are mostly ascii) are one byte long,
but few extra european characters are two bytes.
- Byte ordering isn't ambiguous.
- Strings are null terminated.
- Because of the funky encoding, it is self-synchronizing. You
cannot do a string search for `foo' and accidentally end up in the
`middle' of a character.
The disadvantages are:
- You cannot index into a string without scanning from the start.
- Eastern languages (Indian, Chinese, etc.) require 3-bytes per
character to encode rather than one byte. Believe it or not, this
is percieved by some as an *intentional* slight against eastern
cultures.
-------------------------
A question for Matthew: Why not UTF-16?
--
~jrm