[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Thu Feb 17 10:35:49 EST 2005

At Wed, 16 Feb 2005 21:10:43 -0600, Alex Shinn wrote:
> At Wed, 16 Feb 2005 10:08:32 -0700, Matthew Flatt wrote:
> > 
> > One source of difficulty is that MzScheme sometimes exploits the
> > special properties of UTF-8.
> 
> UTF-8 indeed rocks :) Is this not a property of the internal strings
> though, post-conversion?

Not entirely...

At 17 Feb 2005 09:51:21 -0500, Jim Blandy wrote:
> Matthew Flatt <mflatt at cs.utah.edu> writes:
> > One source of difficulty is that MzScheme sometimes exploits the
> > special properties of UTF-8. For example, regexp matching with strings
> > [...]
> 
> That wouldn't have to change, would it?  I'm not suggesting that
> character strings should contain anything other than Unicode.

I should have written "regexp matching on input ports with
character-based patterns".

MzScheme's regexp matcher works on character strings, byte strings, and
ports, with patterns specified as either byte patterns or character
patterns. All of the combinations can be reduced to byte-pattern
matching on ports.


Regexp matching is one example of how UTF-8 is built into ports.
Another example is line and column counting. And then there's the
implementation of `peek-char' (with lookahead) in terms of `peek-byte'.


Alex Shinn wrote:
> Schemers [...] generally expect that by default
> simple objects such as strings can be written and read back
> consistently.  If by default ports don't support the full internal
> encoding of MzScheme, then some strings may not be serializable.  This
> could be handled by adding escape sequences when a character can't be
> output in the port's locale, but that may be more complexity and
> overhead than you want.

That's another good example.


Jim Blandy wrote:
> (I had thought MzScheme character strings were represented as arrays
> of wide characters

Yes, they are.


Matthew



Posted on the users mailing list.