[plt-scheme] Unicode strings in mzscheme

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Sun Apr 22 19:11:52 EDT 2007

Some clarifications:

At Sun, 22 Apr 2007 14:47:09 -0400, Richard Cobbe wrote:
> I suspect that this is an issue not with PLT's support for Unicode strings,
> but rather with Unicode I/O.  In the case of mzscheme, AFAICT, console I/O
> relies heavily on the Unicode capabilities of the console in which MzScheme
> is running.

MzScheme communicates with the world in UTF-8. A reasonable console
uses the current locale's encoding, which may not be UTF-8.

Probably it would be better for MzScheme to use the current locale's
encoding for stdin, stdout, and stderr when they're connected to a tty.
We haven't yet tried this, mostly due to lack of demand (relative to
lots of other things).

For now, putting something like

 (when (terminal-port? (current-output-port))
  ;; reencode-output-port is from (lib "port.ss"):
  (current-output-port (reencode-output-port (current-output-port) "")))

in your ".mzschemerc" usually works under Unix. There is an issue with
flushing output on exit, though, since MzScheme flushes only the
original ports before exiting.


In the case of files, the right answer is less clear to me. Using UTF-8
everywhere means that we avoid all sorts of problems where a file works
on one machine and not on another. But many programs and libraries use
the current locale's encoding by default for files.


> Unfortunately, I can't translate this to Windows.  It wouldn't surprise me
> to learn that cmd.exe, or the graphical window that sits on top of that,
> can't handle Unicode I/O.  But I don't know how to get a terminal that
> does.

In Windows, there is a notion of a current code page, which is
essentially the same as having a locale with an implied encoding.
Currently, however, MzScheme always pretends that the default locale's
encoding is UTF-8 under Windows. So, the above re-encode operation
doesn't help: it re-encodes UTF-8 to UTF-8.

For now, you can force MzScheme to put a particular encoding under
Windows by setting `current-locale':

  (current-locale "en_US.CP437") ; CP 437 is Latin-1, I think

That is, on my Windows machine,

  (current-locale "en_US.CP437")
  (require (lib "port.ss"))
  (current-output-port (reencode-output-port (current-output-port) "")))
  #\uE1

shows #\á instead of #\<garbage> for the last result.

Matthew



Posted on the users mailing list.