[plt-scheme] confusing behavior with "reencode-input-port"

From: Thomas Chust (chust at web.de)
Date: Tue Jan 6 16:46:11 EST 2009

Neil Van Dyke wrote:
> I'm confused about a behavior of "reencode-input-port".
> 
> If the input is a port created with "open-input-bytes", then it works as
> expected.
> 
> If the input port is created with "open-input-string", however, then
> "reencode-input-port" has an effect that looks like the input is being
> *doubly* reencoded.
> [...]

Hello,

a string in PLT Scheme doesn't carry any information about the encoding
of the data from which it was created (and that's a good thing). When
you open a string input port, you always get data in the same encoding,
which happens to be UTF-8 by default.

Therefore the behaviour you see is exactly what I would have expected:
You tell the system to convert some character data into an UTF-8 stream
by using open-input-string, but then you tell the system to "convert"
that stream from ISO-Latin-1 encoding to UTF-8 with reencode-input-port.
So you get a second UTF-8 stream, but you can't expect that its contents
still have the same meaning as the original data. Conceptually you
perform an unchecked cast from UTF-8 data via raw bytes to ISO-Latin-1
data and then turn the result into UTF-8 again with a "correct" type
conversion, only the data now typed as ISO-Latin-1 is really not in that
format.

If what you wanted to do was to create an input port from which you
could read (as binary data) the representation of your initial string in
ISO-Latin-1 encoding, the best way I can think of is to create a pipe,
use reencode-output-port with ISO-Latin-1 encoding on the sink end of
the pipe, dump the string into it and read the result from the source
end of the pipe. At the moment I don't see how that could be useful, though.

cu,
Thomas



Posted on the users mailing list.