[racket] regexp operations on character input ports returning bytes

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Sat Dec 25 10:43:20 EST 2010

At Sat, 25 Dec 2010 10:23:54 -0500, Neil Van Dyke wrote:
> When doing a regexp on a character input port, what's the best way to 
> get string results out instead of bytes results?

Decode the results of `regexp-match' using `bytes->string/utf-8'.

> For example, this is documented behavior, but not actually what I want, 
> because I don't want to have to re-encode the bytes as a string (plus, I 
> would have to query the input port to find out what its character 
> encoding, if I don't know it a priori):

A string regexp on an input port matches via UTF-8 encoding by
definition, so you can always use UTF-8.

If some layer of the input has a different encoding, it's handled by
conversion to a UTF-8 encoding at the port level.

> do "regexp-match-peek-positions" as a peek and then use "read-string" 

That doesn't work, because you don't know how many characters to read
given the positions in bytes.

> Is there a better way using regexp operations on input ports?

No. Decoding bytes to a string using UTF-8 has to happen at some level,
so there are not really any efficiency or generality issues in
performing the decoding on the result of `regexp-match'.



Posted on the users mailing list.