[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Alex Shinn (foof at synthcode.com)
Date: Wed Feb 16 22:10:43 EST 2005

[Commenting as a semi-outsider here, but I find the issues interesting
and want to see as much consistency wrt Unicode as possible in Schemes.]

At Wed, 16 Feb 2005 10:08:32 -0700, Matthew Flatt wrote:
> 
> One source of difficulty is that MzScheme sometimes exploits the
> special properties of UTF-8.

UTF-8 indeed rocks :) Is this not a property of the internal strings
though, post-conversion?  Or are you considering switching MzScheme's
internal encoding according to the user's locale?

> A second source of difficulty is how to keep source code portable. I
> think your suggestion to handle this in `load' (always go into UTF-8
> mode) would solve that problem.

For this Gauche has followed the Emacs (and Python) approach - it
recognizes a -*- coding: FOO -*- on the first or second line, and
defaults to UTF-8 otherwise.

However, you do still have portability issues with I/O on non-source
files.  If a user relies on their default locale when creating/reading
data files, then by default their code won't work in other users'
locales.  For instance, most free Japanese dictionaries use the EUC-JP
encoding, typically used on Unix systems, but software written for
these will break on Windows systems which tend to use the SJIS
encoding by default.  The author can always explicitly specify the
encoding, but it's probably better to require this and not allow a
lazy default that will lead to non-portable code and potentially
hard-to-track bugs.

This, however, would not be an issue with ports connected to a tty.

Also, Schemes like to emphasize the ability to serialize arbitrary
values to files, so much so that some implementations even support
serialization of closures and continuations.  Although Schemers may
not often use this feature, they generally expect that by default
simple objects such as strings can be written and read back
consistently.  If by default ports don't support the full internal
encoding of MzScheme, then some strings may not be serializable.  This
could be handled by adding escape sequences when a character can't be
output in the port's locale, but that may be more complexity and
overhead than you want.

> A third source of difficulty is also one that you point out: mixing
> character and byte operations, and setting the byte-to-character
> conversion at the right time.

It partly depends on what you mean by mixing character and byte
operations.  One idea is that when you write a byte it should be
considered as part of a character, which begs the question what would
happen when have the odd effect that if you have an incomplete
character in the port and perform I/O at the full character level?
Should you skip the invalid byte sequence or realign?

Either way I think this is unintuitive to the user.  Many formats
intermix genuine binary data with encoded text (possibly in multiple
encodings), and users expect that when they perform a binary operation
on the port they get exactly that binary effect, independent of the
port's text encoding.  In this case automatic (or explicit) conversion
of ports is orthogonal to binary I/O.

-- 
Alex



Posted on the users mailing list.