[plt-scheme] Why do MzScheme ports not respect the locale's encoding by default?

From: Jim Blandy (jimb at redhat.com)
Date: Fri Feb 18 16:55:10 EST 2005

Alex Shinn <foof at synthcode.com> writes:
> At 17 Feb 2005 17:17:37 -0500, Jim Blandy wrote:
> > 
> > You're right that locale-sensitivity leads to all sorts of
> > unpredictability and hidden portability issues.  But as I said, locale
> > encodings are something the user explictly asks for.  If I understand
> > POSIX correctly, the user can always say "LC_ALL=C" and turn
> > everything off if that's what they want.
> 
> Perhaps I misunderstand, but I thought the LC_* POSIX settings were
> meant for i18n, specifically messages to and input from the user.

The locale controls a bunch of different things, including sorting and
collation rules, membership in <ctype.h> character classes (making
those functions basically useless for parsing programs, argh), message
catalogs (what you mentioned), and a bunch of other things.  One of
those things is the character encoding, which includes the multi-byte
character representation.

> That would be an argument for automatic conversion on ports connected
> to ttys, which I'm all for. It certainly wouldn't make sense to try to
> display hieroglyphs to a Latin-1 terminal.  But I don't think that
> should necessarily apply to file or network ports.

Network ports are a separate issue; I don't think those should be
translated by default.

But files and pipes, yes.  Those environment variables are defined
with the exact intent of affecting the way system utilities and
applications interpret text in files, pipes, and ttys, where the
character encoding isn't explicitly specified.  grep respects them.
The shell does.  wc does.  ("wc?  Oh, well, then.")  If the user
doesn't want that behavior, they shouldn't have set those environment
variables.  Remember, by setting those variables, the user has
*explicitly asked you* to apply those conversions to files and pipes.
That's what they mean, and what (as far as I know) they've always
meant.

> > The only way out I see for the authors of your Japanese dictionary
> > software is to write out their own code for parsing EUC-JP, and use
> > that explicitly.  But now that locales exist, programmers must
> > consider when it's appropriate to respect them and when they should be
> > ignored.  (And I wouldn't be surprised to hear of situations where
> > there's no good answer.)
> 
> The programmer should be able to open the dictionary explicitly in
> EUC-JP, then simply display the text to stdout (which would convert
> according to the user's locale).  It's simple and portable.

Hear, hear.  Unfortunately, the portable interfaces are horribly
incomplete:

- There is no way in the ISO C libraries to request a particular
  encoding.  There's bytes, and there's characters encoded according
  to the current locale, and that's it.

- You can request an encoding by name in POSIX, but the only encodings
  required to be supported are "C" and "POSIX", which amount to
  "characters are bytes".

- Using the ISO C standard I/O functions, the encodings selected are
  global to the process.  You can't read in one encoding, and write in
  another.

- The 'iconv' family of functions lets you explicitly name the
  conversion you want, and lets you have multiple independent
  conversions going on at once, but POSIX says *nothing* about the set
  of names supported.  And iconv doesn't even guarantee that all
  combinations of source and destination encodings will work: it might
  support A->B, and B->C, but not A->C, nor even C->B or B->A.

Can you even believe how lame this is?


> > First, a small correction: in msg00027.html, you mention the C I/O
> > API.  ISO C doesn't allow you to mix character and byte operations on
> > a single port.  The first operation on a port sets its orientation
> > ("byte" or "wide"), which is fixed from that point onward; operations
> > of the other orientation are an error.
> 
> Well, to be precise it doesn't allow you to mix wide character (wchar)
> and byte (char) operations.  If you stick to char operations you're
> free to mix, say, fgetc and fread.

Those are both defined as "byte operations".  Section 7.19.1, para. 5:

  The input/output functions are given the following collective terms:

  - The wide character input functions --- those functions described in 7.24
    that perform input into wide characters and wide strings: fgetwc,
    fgetws, getwc, getwchar, fwscanf, wscanf, vfwscanf, and vwscanf.

  - The wide character output functions --- those functions described in 7.24
    that perform output from wide characters and wide strings: fputwc,
    fputws, putwc, putwchar, fwprintf, wprintf, vfwprintf, and
    vwprintf.

  - The wide character input/output functions --- the union of the
    ungetwc function, the wide character input functions, and the wide
    character output functions.

  - The byte input/output functions --- those functions described in
    this subclause that perform input/output: fgetc, fgets, fprintf,
    fputc, fputs, fread, fscanf, fwrite, getc, getchar, gets, printf,
    putc, putchar, puts, scanf, ungetc, vfprintf, vfscanf, vprintf, and
    vscanf.

(I'm sorry to be quoting standards.  I don't know when I became so
reverent about all this lawerly crap.  Thoreau would roll his eyes.)

Basically, the ISO C people realized that the original <stdio.h>
functions (printf, fread, putchar) were being used too widely to do
binary I/O, so rather than break all that code for the sake of
alternative character encodings, they explicitly defined all those
functions to operate on bytes, and introduced the wide character
functions, whose behavior was locale sensitive.

The Scheme people are coming at this the other way around, it seems to
me.  They're assuming that the existing functions --- READ,
READ-CHARACTER, and so on --- should all actually parse multi-byte
characters; if anyone was using them for byte I/O, to heck with them.
Instead of defining the existing functions to all be byte operations
and introducing new character operations, we're defining all the
existing functions to be character operations and introducing byte
operations.  I think it makes decent sense, given that the Scheme
community has never really embraced backward compatibility as a core
value anyway (remember #f vs. ()?).

> > However, the revised text you posted in msg00064.html is more
> > restrictive than ISO C.  In ISO C, the "orientation" of a port is
> > determined by the first I/O operation, not before.  In the revised
> > SRFI-56, it looks to me as if the port's orientation is determined
> > when it is created.  Matthew mentioned that restriction as a source
> > of troubles --- whether anticipated or actually experienced I don't
> > know.
> 
> The current compromise is meant to be as accomodating as possible to
> all systems, including Java which requires specification of the port's
> orientation at creation time.  However, the SRFI is careful to leave
> the issue of mixing of byte and character operations on the same port
> unspecified - an implementation is free to allow this.  This weakened
> stance seemed to satisfy everyone, while still allowing you to write a
> great range of portable programs using binary I/O, which is likely why
> the discussion died down.

Fair enough.

> I'd be very curious to see examples where delaying the port
> orientation until the first operation is useful though.

Me too.  As I say, I think it was Matt Flatt who mentioned it.



Posted on the users mailing list.