[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Mon Feb 28 01:49:52 EST 2005

Alex Shinn writes:
> ISO-2022 also happens to be one of the most complicated encodings, and
> is never used for locales, only mail.

Setting aside the question of whether it's worth supporting, I don't
think ISO-2022 is out of common use.  I have on my disk recent code
from a customer, a large corporation, with comments written in
ISO-2022.  Awk scripts, shell scripts, C code, text files.  It's
recent.  I didn't go out and look for it.

> At 26 Feb 2005 17:47:59 -0500,Jim Blandy wrote:
> > The Japanese hiragana character 'A' followed by the latin character
> > 'a' is encoded in iso-2022-jp as the following sequence of bytes (in
> > hex):
> > 
> >     #x1b #x24 #x42 #x24 #x22 #x1b #x28 #x42 #x61
> > 
> > That's:
> > 
> >     - an "ESC $ B" to get into the two-byte-per-char mode,
> >     - the two bytes #x24 and #x22, indicating the hiragana 'A',
> >     - an "ESC ( B" to get back into the single-byte ASCII mode,
> >     - the single byte #x61, indicating the latin character 'a'.
> > 
> > In your implementation, if I read a character, and then a byte, which
> > character and byte do I get?
> 
> I'm talking about real-life scenarios here.  Why are you reading a
> character and then a byte on text that looks like that?  I think
> you're seeing ghosts where there are no practical problems.  In
> reality, even though you can have text inside binary data and binary
> data inside textual data, there's always *some* way to tell where one
> begins and the other ends, otherwise you have random data.
> 
> In this case a smart implementation might see the trailing ESC ( B and
> read through that as part of hiragana 'A', leaving the read-byte to
> return #x61.

But you agree that it's unclear exactly where the current position in
the byte stream should be left after reading a character.  That
doesn't bother you?  I just don't see how one can ever usefully do a
byte read after having done a character read in ISO-2022, regardless
of the format of the data.  And if you don't know that your encoding
isn't ISO-2022, then you should assume it might be, no?

Well, at least we're understanding each other now.

There's this great sentence from the paper in the Scheme 48
distribution about their module system that I've never forgotten:

    By forcing us to write down interfaces and module dependencies,
    the module system helps us to keep the system clean, or at least
    to keep us honest about how clean or not it is.

If you don't care about supporting encodings like ISO-2022, that's
fine, but you should "keep yourself honest" by making SRFI-56 actually
explain what sorts of encodings permit well-defined mixing.  Maybe
something like this:

  The byte input functions described in this SRFI may only be mixed
  with the character input functions on a given port if the encoding
  used by the character-based functions has the following property:

  - If a given sequence of bytes encodes a given sequence of
    characters, there is no extension of that byte sequence that
    encodes the same sequence of characters.

In other words, it's always clear where to stop.  ISO 2022 doesn't
have that property: you can add as many shift-in/shift-out sequences
as you like.  All the encodings I'm presuming you care about ---
ASCII, ISO-8859-foo, UTF-8, EUC-JP, Big5 --- do.



Posted on the users mailing list.