[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

Mon Feb 21 17:53:50 EST 2005

Matthew Flatt <mflatt at cs.utah.edu> writes:

>   For list-related administrative tasks:
>   http://list.cs.brown.edu/mailman/listinfo/plt-scheme
> 
> At 19 Feb 2005 18:40:18 -0500, Jim Blandy wrote:
> > The reasons I know of for ignoring the current locale's encoding are
> > these:
> > [...]
> > - It's hard to implement.
> > [...]
> > - It's not worth it.
> 
> My current conclusion is a combination of these, with one more piece:
> it doesn't seem worth investing the programming effort needed to
> produce a general-purpose implementation without performance
> surprises.

Okay.

> I've added `reencode-input-port' and `rencode-output-port' to MzLib's
> "port.ss", and you can uses these functions to get a port that decodes
> or encodes according to the locale's encoding. These ports still work
> with MzScheme's regexp matching, etc. --- more or less as you
> suggested.

Sounds great.

> The [proto-]ports created by `reencode-XXX-port' are somewhat heavy,
> though, and the guarantees on the ports are not yet as good as I would
> like. For example, peeking from a reencoded input port doesn't
> translate into peeks of the original input port.

Looking at last night's mzlib documentation, for reencode-input-port,
it sounds like you're trying to promise some sort of relationship
between the read positions in the original port and the UTF-8 port.
You must be passing bytes to the converter one at a time to see when
it actually burps out the next re-encoded character.  Actually, I
guess you could always look at the minimum number of bytes still
needed to satisfy the request, and shove that many through, but when
you're only reading one character at a time from the reencoded port
(for read-line, say), it'll add up to the same thing: passing one byte
to the converter at a time.

For what it's worth, I believe this is why ISO C simply promises
nothing about being able to mix byte and wide character operations on
streams.  In your case, you've got two distinct ports, whereas ISO C
has one stream with two distinct sets of functions to apply to it, but
it's essentially the same thing.

As I said to Alex Shinn, I don't think it ends up being important to
provide those guarantees anyway, because:
- it's too hard to use them without making assumptions about the
  encoding or the converter,
- situations where you need to mix byte and character operations are
  almost always "layered", in that you can find the extent of the text
  in bytes without parsing the text into characters, and
- if you give up on the guarantees, it's easier to promise decent
  performance (given the performance of your conversion library).

But it may be time to set aside the speculation and see what actual
users do with all this.