[racket] What encoding does racket/drracket use for bytestring literals?

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Thu Nov 13 10:12:43 EST 2014

Yes, that's strange behavior.

Just to be clear, the reader works at the layer of characters, which
means that the content of a #"..." literal is expressed in terms of
characters.

The intent is that characters in the Unicode range 0-255 represent the
corresponding byte value in the byte string --- a Latin-1 encoding, if
you prefer --- and other characters are not allowed.

The implementation, however, lost its check on the character value
somewhere along the way. As a result, it effectively `bitwise-and`s
each character's Unicode value with 255! That's especially confusing
and unhelpful. The reader should raise an exception.

So, for all of your examples but the except one, the reported result is
as intended, but the last one should have triggered a reader error.

At Thu, 13 Nov 2014 16:29:07 +0200, Tomi Pieviläinen wrote:
> #lang racket
> (bytes=? #"äöå"
>          (string->bytes/latin-1 "äöå"))
> 
> (bytes=? (string->bytes/utf-8 "äöå")
>          (string->bytes/locale "äöå"))
> (not (bytes=? #"€"
>               #"\244"))
> 
> gives me #t, #t and #t on both racket and drracket. So in other words
> latin-1 chars are interpreted as latin-1, but latin-9 is something
> else. And it definately isn't using the system locale, which is UTF-8
> on my computer.
> 
> So how does racket decode bytestring literals?
> 
> -- 
> Tomi Pieviläinen, +358 400 487 504
> A: Because it disrupts the natural way of thinking.
> Q: Why is top posting frowned upon?
> 
> ____________________
>   Racket Users list:
>   http://lists.racket-lang.org/users


Posted on the users mailing list.