[plt-scheme] Unicode on the cheap

From: Robby Findler (robby at cs.uchicago.edu)
Date: Sat Jan 24 22:47:43 EST 2004

On Jan 24, 2004, at 7:30 PM, Matthew Flatt wrote:
>    A PLT Scheme "char" is therefore a UTF-8 code unit, much like a Java
>    "char" is a UTF-16 code unit (not to be confused with a "Unicode
>    character", or even a "Unicode code point").
> [...]
> The most visible implication of the UTF-8 approach is that, in
> DrScheme's REPL,
>
>    (string-length "$%^#")
>
> will produce a value between 4 to 20, depending on the letters in place
> of "$", "%", "^", and "#". For Chinese letters, the result will tend to
> be 12. For non-English Latin-based letters, the result will tend to be
> 4 or 5.

Those two don't quite seem to reconcile, from what I've understood of 
utf-8 code points. If I see four Chinese characters on the screen 
between the quotation marks, the answer should be 4, right? That means 
that there are 4 utf-8 code points and thus the string has four 
characters and thus the result should be 4? Or, are you suggesting that 
this function:

   (lambda (x) (= (string-length x) (length (string->list x))))

would return #f sometimes?

Also, is the reason that regexps don't work out quite due to the fact 
that the library we use for regexps doesn't work with utf-8 strings? Is 
it possible there is a new version of the library that might?

On Jan 24, 2004, at 9:01 PM, David T. Pierson wrote:
> Joel Spolsky has written a brief introductory article entitled, "The
> Absolute Minimum Every Software Developer Absolutely, Positively Must
> Know About Unicode and Character Sets (No Excuses!)", which might be
> helpful:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>

Thanks, David. That was quite helpful!

Robby



Posted on the users mailing list.