[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Thu Feb 17 17:17:37 EST 2005

You're right that locale-sensitivity leads to all sorts of
unpredictability and hidden portability issues.  But as I said, locale
encodings are something the user explictly asks for.  If I understand
POSIX correctly, the user can always say "LC_ALL=C" and turn
everything off if that's what they want.  Only the user is in a
position to know whether they're doing themselves a favor or shooting
themselves in the foot.  When we ignore their instructions, we can
only make it more likely that they get shot in the foot no matter what
they do.

The only way out I see for the authors of your Japanese dictionary
software is to write out their own code for parsing EUC-JP, and use
that explicitly.  But now that locales exist, programmers must
consider when it's appropriate to respect them and when they should be
ignored.  (And I wouldn't be surprised to hear of situations where
there's no good answer.)

Regarding mixing byte and character operations: I've looked through
the SRFI-56 archives, especially the thread started by Per Bothner
here:

http://srfi.schemers.org/srfi-56/mail-archive/msg00024.html

There are two points I'd like to make:


First, a small correction: in msg00027.html, you mention the C I/O
API.  ISO C doesn't allow you to mix character and byte operations on
a single port.  The first operation on a port sets its orientation
("byte" or "wide"), which is fixed from that point onward; operations
of the other orientation are an error.

However, the revised text you posted in msg00064.html is more
restrictive than ISO C.  In ISO C, the "orientation" of a port is
determined by the first I/O operation, not before.  In the revised
SRFI-56, it looks to me as if the port's orientation is determined
when it is created.  Matthew mentioned that restriction as a source
of troubles --- whether anticipated or actually experienced I don't
know.


Secondly, I still feel like the discussion was glib about mixing
character and byte operations.  I don't think everyone's thinking
carefully about just how much leeway multi-byte encodings have, and
the hair that can result from that.

For example, ISO C requires that encodings of characters other than
'\0' must never contain zero bytes, and that all characters in the
basic execution character set (specified to be the letters, digits,
the symbols used in the C language itself, and some whitespace
characters), be represented in single bytes.  But within those rules,
anything goes.

That permits encodings that can produce two characters from a single
byte.  (The Z-machine, used to implement Zork, uses an encoding where
this can happen.)  If you read the first of those characters, and then
read a byte, which byte do you get?

Or more realistically, what if your ISO-2022 text contains a single
Chinese character, with shift-in and shift-out sequences around it?
After you've read that Chinese character, what's the next byte you
see?  The first byte of the shift-out sequence?  Or the first byte
after the shift-out sequence?  If I have a series of shift-in and
shift-out sequences that enclose no actual characters, when do I
consume those, and when do I leave them unread?

None of these questions are unanswerable.  But each answer depends on
the specifics of the encoding at hand.  I believe that you can't
specify a single behavior across all permitted encodings, because
there's too much latitude.  You could add further restrictions on
encodings to make things manageable.  But nobody in the SRFI-56
conversation suggested that, or even mentioned issues like this, which
left me quite uncomfortable.  I think these concerns are what motivate
the restrictions imposed by the ISO C interface.

Certainly there are many binary formats that include text.  But
everything I've seen does so in a "contained" way: the enclosing
binary format gives you rules, independent of the text encoding, to
decide where the string begins and ends.  You don't need to parse the
text into characters to determine its extent.  You extract the text's
bytes according to the rules of the containing format, and then you
use the appropriate encoding to parse those bytes into characters.
There's a clear layering there.  It simply isn't reasonable to expect
generic ports to handle layered text like that in any automatic way:
only code interpreting the containing format can make the right
decisions.

Any other way of doing it, as far as I can tell, must run into the
ambiguities I've noted above, and end up specifying encoding-specific
rules to resolve them, or must impose restrictions on the encodings
that can be used.  Again, none of this is stuff that a port
implementation can be expected to handle automatically.


My feeling is that, since there are so few assumptions that hold true
across all encodings that we might use to get a handle on the problem,
best approach is to keep your layers strictly separated, like in
protocol stacks.  You have primitives that convert streams of bytes to
streams of characters, and vice versa; and the important thing is to
make sure you can readily use those primitives as layers in a stack
--- flexible ways to provide those streams with input and consume
their output.

What's great about MzScheme's behavior is that it's completely clear
what happens at each layer, but it places no restrictions at all on
the encoding.  If you've got an unconverted input port, it carries
bytes directly from the outside world, which your input functions will
treat as UTF-8.  If you've got a converted input port, it carries text
converted to UTF-8, and you just don't know the relationship with the
original bytes.  To parse text contained in binary files, you follow
the format's rules for finding the extent of the text, extract it as a
byte string, and then make a port that reads from that string.  Or you
construct a custom port that reads bytes from a specific subrange of
the source.

My only quibble is exactly when converted ports should be constructed.

I've been reading the GNU C library's source code to try to understand
how it implements fgetpos and fsetpos, but I haven't been able to dope
it out.  I must need a bigger screen.  :)



Posted on the users mailing list.