[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?
The MzScheme manual, section 1.2.2, "Locale", says that by default,
input and output ports don't translate between Unicode, used
internally by MzScheme, and the current locale's encoding.
Why is this? After all, the locale is chosen by the user. The system
administrator can set a default, but the user can override that. It
seems as if MzScheme is ignoring the user's stated preference.
Translating by default raises some questions, but I think they do have
good answers:
- Sometimes you need to mix byte and character operations.
Translating by default makes that impossible.
Mixing byte and character operations is only well-defined when you
actually know something about the encoding when you write the code.
You can't know how your byte and character operations will interact
otherwise. Code which knows that much about the text has enough
knowledge to override the default translation and do something more
sophisticated. I'd think most code won't be doing something that
hairy.
And in practical terms, if you're going to use iconv as your
conversion engine, the iconv interface makes it pretty difficult to
switch between the translated character stream and the underlying
byte stream anyway. iconv gives no indication as to which bytes
correspond to which characters. So you're doomed anyway if you're
trying to do this without additional information.
- Even when you want to just read bytes, you may not know that until
after the port has been opened.
The ISO C standard's solution to this seems decent. At first, stdio
streams are "without orientation", and the first I/O operation on
the stream makes the stream either "byte-oriented" or "wide-
oriented". Once a stream has an orientation, it can't be changed.
In other words, you can't mix byte and character operations on a
stream, but you don't have to choose the orientation when you create
the port; you can put off the choice as long as possible.
(In the GNU C library, each stream has separate byte and wide
character buffers, both initially empty, so the first input
operation always causes a buffer underflow. The wide and byte input
operations call separate underflow functions, which check for
attempts to mix orientations and set the stream's initial
orientation.)
MzScheme's 'load' needs to mix byte and character operations to
support the #~ syntax, but that's fine: just stipulate that MzScheme
source files always use UTF-8, and that load doesn't use the current
locale's encoding. That's effectively what MzScheme does now.
Okay, I'll try to stop guessing at the answer to my question now.