[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Alex Shinn (foof at synthcode.com)
Date: Sat Feb 26 21:24:29 EST 2005

At 26 Feb 2005 17:47:59 -0500,Jim Blandy wrote:
> 
> Alex Shinn <foof at synthcode.com> writes:
> > I'm honestly puzzled as to what could be hard to use about it.  I
> > consider the C model downright painful to use.
> 
> The Japanese hiragana character 'A' followed by the latin character
> 'a' is encoded in iso-2022-jp as the following sequence of bytes (in
> hex):
> 
>     #x1b #x24 #x42 #x24 #x22 #x1b #x28 #x42 #x61
> 
> That's:
> 
>     - an "ESC $ B" to get into the two-byte-per-char mode,
>     - the two bytes #x24 and #x22, indicating the hiragana 'A',
>     - an "ESC ( B" to get back into the single-byte ASCII mode,
>     - the single byte #x61, indicating the latin character 'a'.
> 
> In your implementation, if I read a character, and then a byte, which
> character and byte do I get?

I'm talking about real-life scenarios here.  Why are you reading a
character and then a byte on text that looks like that?  I think
you're seeing ghosts where there are no practical problems.  In
reality, even though you can have text inside binary data and binary
data inside textual data, there's always *some* way to tell where one
begins and the other ends, otherwise you have random data.

In this case a smart implementation might see the trailing ESC ( B and
read through that as part of hiragana 'A', leaving the read-byte to
return #x61.

ISO-2022 also happens to be one of the most complicated encodings, and
is never used for locales, only mail.

-- 
Alex



Posted on the users mailing list.