[racket] regexp operations on character input ports returning bytes
At Sat, 25 Dec 2010 10:23:54 -0500, Neil Van Dyke wrote:
> When doing a regexp on a character input port, what's the best way to
> get string results out instead of bytes results?
Decode the results of `regexp-match' using `bytes->string/utf-8'.
> For example, this is documented behavior, but not actually what I want,
> because I don't want to have to re-encode the bytes as a string (plus, I
> would have to query the input port to find out what its character
> encoding, if I don't know it a priori):
A string regexp on an input port matches via UTF-8 encoding by
definition, so you can always use UTF-8.
If some layer of the input has a different encoding, it's handled by
conversion to a UTF-8 encoding at the port level.
> do "regexp-match-peek-positions" as a peek and then use "read-string"
That doesn't work, because you don't know how many characters to read
given the positions in bytes.
> Is there a better way using regexp operations on input ports?
No. Decoding bytes to a string using UTF-8 has to happen at some level,
so there are not really any efficiency or generality issues in
performing the decoding on the result of `regexp-match'.