[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Sat Feb 26 18:26:18 EST 2005

Alex Shinn <foof at synthcode.com> writes:
> > Did you actually check?
> 
> Checked and implemented.
> 
> > RFC 2047 <http://www.ietf.org/rfc/rfc2047.txt> is the spec for using
> > non-ASCII characters in message headers.
> 
> Actually I said "body" of the response.  Did you read the section on
> chunked encoding?  A good summary of the difficulties was referenced
> in the SRFI-56 discussion:
> 
>   http://www.haskell.org/pipermail/haskell-cafe/2004-September/006801.html

RFC 2616 section 3.6.1 defines the "Chunked Transfer Coding".  It says
that the chunk length is specified in octets.  So this is another case
where the containing protocol provides information allowing one to
identify the bytes that represent characters in some encoding, without
parsing the characters.  If you read the chunk-data as characters,
you'll mis-identify the end of the chunk and deadlock.  You need not
(or, actually, must not) read a given number of characters and then
switch back to reading bytes.

Oleg's post argues for the necessity of layering.  The underlying
layer reads characters in ASCII to parse the chunk-size and
chunk-extension, and then reads bytes.  There's no problem there, of
course, since the protocol specifies that ASCII is being used, so we
can assume ASCII's properties hold.  The chunks are then pieced
together to form a stream possibly in some other encoding, like
iso-2022-jp; that should be handled by a layer added on top.

And Oleg's right on the money.  In fact, I called for layering in my
earlier post:

http://list.cs.brown.edu/pipermail/plt-scheme/2005-February/007965.html

Oleg says you need it to handle complex situations.  I say you can't
even clearly define the behavior of your primitives unless they're
based on layering.

The problem with SRFI-56 is that it allows for ports to satisfy both
BINARY-PORT? and CHARACTER-PORT?, but it does not specify how
character and byte reads interact clearly enough for such ports to be
useable, if the character encoding, and encoding-specific details of
the conversion, are not specified.



Posted on the users mailing list.