[plt-scheme] Unicode on the cheap
On Jan 24, 2004, at 7:30 PM, Matthew Flatt wrote:
> A PLT Scheme "char" is therefore a UTF-8 code unit, much like a Java
> "char" is a UTF-16 code unit (not to be confused with a "Unicode
> character", or even a "Unicode code point").
> [...]
> The most visible implication of the UTF-8 approach is that, in
> DrScheme's REPL,
>
> (string-length "$%^#")
>
> will produce a value between 4 to 20, depending on the letters in place
> of "$", "%", "^", and "#". For Chinese letters, the result will tend to
> be 12. For non-English Latin-based letters, the result will tend to be
> 4 or 5.
Those two don't quite seem to reconcile, from what I've understood of
utf-8 code points. If I see four Chinese characters on the screen
between the quotation marks, the answer should be 4, right? That means
that there are 4 utf-8 code points and thus the string has four
characters and thus the result should be 4? Or, are you suggesting that
this function:
(lambda (x) (= (string-length x) (length (string->list x))))
would return #f sometimes?
Also, is the reason that regexps don't work out quite due to the fact
that the library we use for regexps doesn't work with utf-8 strings? Is
it possible there is a new version of the library that might?
On Jan 24, 2004, at 9:01 PM, David T. Pierson wrote:
> Joel Spolsky has written a brief introductory article entitled, "The
> Absolute Minimum Every Software Developer Absolutely, Positively Must
> Know About Unicode and Character Sets (No Excuses!)", which might be
> helpful:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
Thanks, David. That was quite helpful!
Robby