[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Thu Feb 17 18:53:10 EST 2005

Matthew Flatt <mflatt at cs.utah.edu> writes:
> At 17 Feb 2005 09:51:21 -0500, Jim Blandy wrote:
> > Matthew Flatt <mflatt at cs.utah.edu> writes:
> > > One source of difficulty is that MzScheme sometimes exploits the
> > > special properties of UTF-8. For example, regexp matching with strings
> > > [...]
> > 
> > That wouldn't have to change, would it?  I'm not suggesting that
> > character strings should contain anything other than Unicode.
> 
> I should have written "regexp matching on input ports with
> character-based patterns".
> 
> MzScheme's regexp matcher works on character strings, byte strings, and
> ports, with patterns specified as either byte patterns or character
> patterns. All of the combinations can be reduced to byte-pattern
> matching on ports.

I still don't see how that matters.  Right now, regexp-match applied
to an input port assumes that input port carries UTF-8 text.  If we
apply translation by default, that just makes it more likely that that
assumption is true.  Is there some reason regexp matching will behave
differently on a translated port than an untranslated port?  Is there
some way in which the translation won't be transparent?

I'm simply suggesting is that translation be done by default where
under the current design one has to ask for it explicitly.  If
translation by default will cause problems, then those problems
already exist in the current design when people ask for translations
explicitly.

> Alex Shinn wrote:
> > Schemers [...] generally expect that by default
> > simple objects such as strings can be written and read back
> > consistently.  If by default ports don't support the full internal
> > encoding of MzScheme, then some strings may not be serializable.  This
> > could be handled by adding escape sequences when a character can't be
> > output in the port's locale, but that may be more complexity and
> > overhead than you want.
> 
> That's another good example.

That *is* a mess.  But is it more or less of a mess than
misinterpreting non-ASCII input when the locale's encoding isn't
UTF-8?



Posted on the users mailing list.