[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Thu Feb 17 09:51:21 EST 2005

Matthew Flatt <mflatt at cs.utah.edu> writes:
> At 14 Feb 2005 10:13:39 -0500, Jim Blandy wrote:
> > The MzScheme manual, section 1.2.2, "Locale", says that by default,
> > input and output ports don't translate between Unicode, used
> > internally by MzScheme, and the current locale's encoding.
> >
> > Why is this?
> 
> The reasons are still not clear to me, which is why it's taken me a
> while to respond. I would really prefer to have ports use the locale's
> encoding by default, but each time I try to implement it, so many
> issues arise that it doesn't seem worthwhile.

It's nice to hear I'm not alone.  :)

> One source of difficulty is that MzScheme sometimes exploits the
> special properties of UTF-8. For example, regexp matching with strings
> is easily implemented in terms of regexp matching on bytes (i.e., a
> string pattern is easily converted to a byte pattern, where the byte
> pattern's matches are exactly the UTF-8 encodings of the string
> pattern's matches). We could do without these tricks, though at the
> expense of code complexity and a little performance.

That wouldn't have to change, would it?  I'm not suggesting that
character strings should contain anything other than Unicode.  I'm
suggesting that MzScheme should take advantage of information it
currently ignores to ensure more text is actually in UTF-8 before
regexp matching, etc. takes place.

(I had thought MzScheme character strings were represented as arrays
of wide characters, but your paragraph suggests that they are
represented using UTF-8, and you are re-using your byte matching
regexp engine on character strings by transforming character regexps
into equivalent byte regexps.  Otherwise I don't understand how regexp
matching on strings could depend on the properties of UTF-8.)

> Those are the difficulties that I can remember, and fixing all of them
> looks like a small matter of programming --- as long as it's really
> worthwhile. In the end, I don't have enough use for non-UTF-8 encodings
> myself for it to seem worthwhile, so I need other people to tell me how
> important this problem is (in both the short and long runs).

Speaking as a dumb monolingual American, I don't know how important it
is either.  To be honest, I'm actually hoping to justify *ripping out*
the conversion code that I've currently got, and leaving the
implementation of the translation to someone who actually needs it and
knows how it should work.  That is, the MzScheme approach, adjusted to
make ports convert by default, is more modular than what I've got at
the moment, making it easier to delegate to a qualified party.



Posted on the users mailing list.