[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Mon Feb 14 10:13:39 EST 2005

The MzScheme manual, section 1.2.2, "Locale", says that by default,
input and output ports don't translate between Unicode, used
internally by MzScheme, and the current locale's encoding.

Why is this?  After all, the locale is chosen by the user.  The system
administrator can set a default, but the user can override that.  It
seems as if MzScheme is ignoring the user's stated preference.

Translating by default raises some questions, but I think they do have
good answers:

- Sometimes you need to mix byte and character operations.
  Translating by default makes that impossible.

  Mixing byte and character operations is only well-defined when you
  actually know something about the encoding when you write the code.
  You can't know how your byte and character operations will interact
  otherwise.  Code which knows that much about the text has enough
  knowledge to override the default translation and do something more
  sophisticated.  I'd think most code won't be doing something that
  hairy.

  And in practical terms, if you're going to use iconv as your
  conversion engine, the iconv interface makes it pretty difficult to
  switch between the translated character stream and the underlying
  byte stream anyway.  iconv gives no indication as to which bytes
  correspond to which characters.  So you're doomed anyway if you're
  trying to do this without additional information.

- Even when you want to just read bytes, you may not know that until
  after the port has been opened.

  The ISO C standard's solution to this seems decent.  At first, stdio
  streams are "without orientation", and the first I/O operation on
  the stream makes the stream either "byte-oriented" or "wide-
  oriented".  Once a stream has an orientation, it can't be changed.
  In other words, you can't mix byte and character operations on a
  stream, but you don't have to choose the orientation when you create
  the port; you can put off the choice as long as possible.

  (In the GNU C library, each stream has separate byte and wide
  character buffers, both initially empty, so the first input
  operation always causes a buffer underflow.  The wide and byte input
  operations call separate underflow functions, which check for
  attempts to mix orientations and set the stream's initial
  orientation.)

MzScheme's 'load' needs to mix byte and character operations to
support the #~ syntax, but that's fine: just stipulate that MzScheme
source files always use UTF-8, and that load doesn't use the current
locale's encoding.  That's effectively what MzScheme does now.

Okay, I'll try to stop guessing at the answer to my question now.



Posted on the users mailing list.