[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Wed Feb 16 12:08:32 EST 2005

At 14 Feb 2005 10:13:39 -0500, Jim Blandy wrote:
> The MzScheme manual, section 1.2.2, "Locale", says that by default,
> input and output ports don't translate between Unicode, used
> internally by MzScheme, and the current locale's encoding.
>
> Why is this?

The reasons are still not clear to me, which is why it's taken me a
while to respond. I would really prefer to have ports use the locale's
encoding by default, but each time I try to implement it, so many
issues arise that it doesn't seem worthwhile.


One source of difficulty is that MzScheme sometimes exploits the
special properties of UTF-8. For example, regexp matching with strings
is easily implemented in terms of regexp matching on bytes (i.e., a
string pattern is easily converted to a byte pattern, where the byte
pattern's matches are exactly the UTF-8 encodings of the string
pattern's matches). We could do without these tricks, though at the
expense of code complexity and a little performance.

A second source of difficulty is how to keep source code portable. I
think your suggestion to handle this in `load' (always go into UTF-8
mode) would solve that problem.

A third source of difficulty is also one that you point out: mixing
character and byte operations, and setting the byte-to-character
conversion at the right time. This is particularly interesting when
ports allow peeking arbitrarily far into a stream. Again, as you
suggest, one reasonable solution is to split the character and byte
buffers, and to have some policy for how eagerly bytes are consumed for
the character buffer.

Those are the difficulties that I can remember, and fixing all of them
looks like a small matter of programming --- as long as it's really
worthwhile. In the end, I don't have enough use for non-UTF-8 encodings
myself for it to seem worthwhile, so I need other people to tell me how
important this problem is (in both the short and long runs).


Meanwhile, there's the Java approach, where byte ports and character
ports are distinguished. I think that MzScheme's current solution is
compatible with this view, because a port can be wrapped to convert any
encoding into a UTF-8 encoding. (I keep meaning to add wrapping
functions to MzLib's "port.ss" module, but I haven't, yet.)


Matthew



Posted on the users mailing list.