[plt-scheme] efficiently converting from foo to Latin-1-safe for storage in strings

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Sun Jan 4 18:54:58 EST 2009

At Sun, 04 Jan 2009 11:25:03 -0500, Neil Van Dyke wrote:
> Let's say that most of my system is using UTF-8, but there is one part 
> that is not yet.
> 
> Until that part can be reworked, I want to make sure that all 
> user-supplied strings (which are read from a port in character encoding 
> "foo") have been converted to Latin-1, with any non-Latin-1 characters 
> replaced with question-marks.
> 
> These converted Latin-1-safe string values are stored in Scheme strings.
> 
> What's an efficient way to do this?  Plug multiple "reencode-input-port" 
> together, to convert from "foo" to Latin-1 to UTF-8?

If you control the reading of strings from the port, then I recommend
composing `string->bytes/latin-1' (using `(char->integer #\?)' as the
second argument) and `bytes->string/latin-1' to filter the strings.

If you need a port whose stream contains only Latin-1 characters, then
it's more complicated, and I doubt that you can do it with
`reencode-input-port'. The solution I see is to create a pipe and a
background process that read from the original and writes filtered
bytes/characters into the pipe, so that you use the read end of the
pipe in place of the original port. (The filter should always read
characters, but after filtering, it should write either characters or
bytes back into the pipe, depending on whether the Latin-1 content is
to be read with functions like `read-string' or like `read-bytes'.)


Matthew



Posted on the users mailing list.