[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Mon Feb 28 02:09:08 EST 2005

Alex Shinn <foof at synthcode.com> writes:
> At 26 Feb 2005 18:26:18 -0500, Jim Blandy wrote:
> > 
> > RFC 2616 section 3.6.1 defines the "Chunked Transfer Coding".  It says
> > that the chunk length is specified in octets.  So this is another case
> > where the containing protocol provides information allowing one to
> > identify the bytes that represent characters in some encoding, without
> > parsing the characters.
> 
> The length is a hex string, the chunk data itself is binary.
> A simplified decoder might look like:
> 
>   (define (read-chunked-data port)
>     (let lp ((res '()))
>       (let ((line (read-line port)))
>         (if (eof-object? line)
>           (block-concatenate-reverse res)
>           (lp (cons (read-block (number->string line 16) port) res))))))

Let me be picky, although it's not really germane: the HTTP RFC says
chunk-size and chunk-extension (what you're reading with that call to
'read-line') use US-ASCII.  So you're assuming that the character
encoding is always a superset of ASCII, which R5RS does not promise.
No, I don't care about EBCDIC either.  But I do wish we had an
interface that let you actually say what you meant, and actually
specify ASCII, instead of doing all this presumption.

> To be clear, are you advocating the use of two disjoint string types
> in Scheme as in C?

Among other things, I'm advocating distinguishing byte vectors and
strings, as MzScheme does.  Is that what you mean?

> Layering and/or procedural ports is a nice approach, and one I would
> like to see in a future SRFI, but is higher-level than I wanted to get
> into with SRFI-56.  Given SRFI-56 you can implement layering (and
> given layering you can implement SRFI-56) so it seems logical to start
> with the lowest-level approach first.  People are then free to come up
> with competing layering approaches.

I think you should do them both in the same SRFI.  The behavior
described in the present SRFI-56:

- is not well-defined when using encodings like ISO-2022, which are
  unpleasant in many ways but, unfortunately, still in use, and

- forces implementations that use iconv to behave in a really
  inefficient way.  They have to feed iconv input one byte at a time,
  because they don't know whether the next read will be a byte read.

Both of these problems go away if you provide layering functions.
There's no longer any need to have byte reads behave in any particular
way following character reads.  There's no need to restrict yourself
to "reasonable" encodings.  Everything's behavior can be specified in
a complete and straightforward way.  In real-life scenarios, you
always have enough information to do the layering.  And you provide
functions we all know people want anyway.



Posted on the users mailing list.