[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?
At 26 Feb 2005 17:47:59 -0500,Jim Blandy wrote:
>
> Alex Shinn <foof at synthcode.com> writes:
> > I'm honestly puzzled as to what could be hard to use about it. I
> > consider the C model downright painful to use.
>
> The Japanese hiragana character 'A' followed by the latin character
> 'a' is encoded in iso-2022-jp as the following sequence of bytes (in
> hex):
>
> #x1b #x24 #x42 #x24 #x22 #x1b #x28 #x42 #x61
>
> That's:
>
> - an "ESC $ B" to get into the two-byte-per-char mode,
> - the two bytes #x24 and #x22, indicating the hiragana 'A',
> - an "ESC ( B" to get back into the single-byte ASCII mode,
> - the single byte #x61, indicating the latin character 'a'.
>
> In your implementation, if I read a character, and then a byte, which
> character and byte do I get?
I'm talking about real-life scenarios here. Why are you reading a
character and then a byte on text that looks like that? I think
you're seeing ghosts where there are no practical problems. In
reality, even though you can have text inside binary data and binary
data inside textual data, there's always *some* way to tell where one
begins and the other ends, otherwise you have random data.
In this case a smart implementation might see the trailing ESC ( B and
read through that as part of hiragana 'A', leaving the read-byte to
return #x61.
ISO-2022 also happens to be one of the most complicated encodings, and
is never used for locales, only mail.
--
Alex