[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Fri Feb 25 14:40:34 EST 2005

Alex Shinn <foof at synthcode.com> writes:
> At 21 Feb 2005 17:53:50 -0500, Jim Blandy wrote:
> > 
> > For what it's worth, I believe this is why ISO C simply promises
> > nothing about being able to mix byte and wide character operations on
> > streams.  In your case, you've got two distinct ports, whereas ISO C
> > has one stream with two distinct sets of functions to apply to it, but
> > it's essentially the same thing.
> > 
> > As I said to Alex Shinn, I don't think it ends up being important to
> > provide those guarantees anyway, because:
> > - it's too hard to use them without making assumptions about the
> >   encoding or the converter,
> 
> Intuitively the port has a character encoding that takes effect when
> you perform character-level operations, and is ignored when you
> perform binary operations.  It may be hard to implement, but not to
> use.

It's hard to implement, and to use.  In order to use a facility
properly, you need to be able to distinguish the properties it happens
to have as you do your development from the properties the designers
promise it will always have.  I've argued that there are too few
properties one can guarantee without making restrictive assumptions
about the encodings at hand.

But you try it.  The first word of your paragraph is "intuitively".
Write us a nice tight description of exactly how interleaved character
and byte read operations behave that makes no assumptions about the
encoding beyond those made by ISO C, but allows the user to always
predict the next byte that will be read given any sequence of input
bytes in any character encoding, and any interleaving of byte and
character reads.  This is a description which, if it can be written,
ought to be in SRFI 56 anyway, so it's no waste of time either way.

> > - situations where you need to mix byte and character operations are
> >   almost always "layered", in that you can find the extent of the text
> >   in bytes without parsing the text into characters, and
> 
> Many network protocols mix byte and character data, including HTTP and
> FTP.  To read a response you need to parse in terms of lines of
> characters, and then possibly switch to binary operations for the body
> of the response.

Did you actually check?

RFC 2616 <http://www.ietf.org/rfc/rfc2616.txt> defines HTTP/1.1.
Section 2.2 defines the basic rules for HTTP headers.  Everything
there is specified to use US-ASCII, except for general text field
contents:

     The TEXT rule is only used for descriptive field contents and
     values that are not intended to be interpreted by the message
     parser. Words of *TEXT MAY contain characters from character sets
     other than ISO-8859-1 [22] only when encoded according to the
     rules of RFC 2047 [14].

RFC 2047 <http://www.ietf.org/rfc/rfc2047.txt> is the spec for using
non-ASCII characters in message headers.  It requires encoded words to
take the form:

        =?<charset>?<encoding>?<encoded-text>?=

I'm not too familiar with these specs; if I've missed something let me
know.  But it looks to me like this fits the description I gave
earlier: the lower-level protocol specifies how characters are encoded
in a way that allows one to determine the extent of the characters'
bytes before one actually parses the characters.  It carefully avoids
requiring one to parse characters using an arbitrary encoding, and
then switch to binary.

I'm not going to check the FTP spec; it's your turn to do boring
reading.  :)

You mention iconv in your next message.  I love iconv: it's in POSIX,
and there's a Freely available implementation that supports tons of
conversions.  I'd like to use iconv.  But iconv's performance is lousy
if you use it in the way you're implicitly suggesting, and in the way
I gather Matt has used it in MzScheme's ports.ss: pushing one byte at
a time into the input so you can see exactly when the next character
pops out the other end --- like force-feeding an earthworm one grain
of dirt at a time.

The ISO C restrictions allow one (especially if one is the GNU C
library) to use iconv on blocks of bytes at a time.  So you get good
performance out of a very general and simple interface.



Posted on the users mailing list.