[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Mon Feb 28 02:32:54 EST 2005

Michael Sperber <sperber at informatik.uni-tuebingen.de> writes:
> Jim> But you try it.  The first word of your paragraph is "intuitively".
> Jim> Write us a nice tight description of exactly how interleaved character
> Jim> and byte read operations behave that makes no assumptions about the
> Jim> encoding beyond those made by ISO C, but allows the user to always
> Jim> predict the next byte that will be read given any sequence of input
> Jim> bytes in any character encoding, and any interleaving of byte and
> Jim> character reads.  This is a description which, if it can be written,
> Jim> ought to be in SRFI 56 anyway, so it's no waste of time either way.
> 
> I think ISO C is an unfortunate reference, as the underlying character
> representation is encoding-dependent (as I understand it, at least),
> which it isn't in PLT Scheme.  Once you fix the character semantics, a
> whole lot of prooblems go away.
> 
> I may have gotten confused about what the focus of this discussion is
> along the way---if so, I'd appreciate a little help :-)

You're probably not the only one.  :)

My claim is that it's impossible to precisely specify the behavior of
mixed byte and character reads on a port if the character encoding
doesn't have some restrictions imposed on it.  It can't be left
completely unspecified.  Restricting it to real-world encodings still
doesn't solve the problem.  Here's the example I gave before:

    The Japanese hiragana character 'A' followed by the latin
    character 'a' is encoded in iso-2022-jp as the following sequence
    of bytes (in hex):

        #x1b #x24 #x42 #x24 #x22 #x1b #x28 #x42 #x61

    That's:

        - an "ESC $ B" to get into the two-byte-per-char mode,
        - the two bytes #x24 and #x22, indicating the hiragana 'A',
        - an "ESC ( B" to get back into the single-byte ASCII mode,
        - the single byte #x61, indicating the latin character 'a'.

My concern is, after reading one character, where is the current
position in the byte stream --- before or after the "ESC ( B"?

There are lots of solutions:

1) Amend SRFI-56 to clarify that it's not well-defined to follow a
   character read with a byte read in some encodings.  I've suggested
   language that (I think) characterizes the sorts of encodings where
   this would be well-defined.

   This option removes the ambiguity, but still forces implementations
   that use iconv for their conversion to do so in a very inefficent
   way: stuffing in one byte a time, to avoid passing bytes to iconv
   that should have been returned to the user via read-byte.

2) Amend SRFI-56 to restrict ports to be either char-only or
   byte-only.  It allows implementations to make this restriction, but
   as I say, if the implementation doesn't restrict the character
   encodings supported, then it's difficult to write code that makes
   no assumptions about the underlying encoding anyway.

   This option allows arbitrary encodings, and it allows iconv-based
   implementations to go fast, but the restrictions on mixed
   operations are kind of frustrating.

3) Amend SRFI-56 as in 2), but add functions to do port layering,
   which I think is the nicest way to describe the cases we've had
   mentioned (ELF files; HTTP; and so on).

   I recognize the value of keeping the SRFI's scope limited.  But I
   think these problems are more intimately related than they might
   seem.



Posted on the users mailing list.