[racket] windows-1252 charset decoding

From: John Clements (johnbclements at gmail.com)
Date: Wed Mar 4 15:04:15 EST 2015

I see that the documentation suggests that (entity-charset) is supposed to
return a symbol. However, it nearly always returns a string. In particular,
it appears to me that it returns a symbol only when it returns its default,
'us-ascii.

I feel compelled to repair this, but there are two ways to fix it:
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.

It looks to me like #2 will break (less) code, though it's certainly
possible that people depend on the default value's being a string.

Opinions? In my tree, I've added contract checks on the structure exports
and changed the documentation and default to always return a string. If
people like this, I can just submit it as a pull request.

John


On Tue, Mar 3, 2015 at 10:11 PM, John Clements <clements at brinckerhoff.org>
wrote:

>
> On Mar 3, 2015, at 4:31 PM, Matthew Flatt <mflatt at cs.utah.edu> wrote:
>
> > You can use "windows-1252" as an encoding name with, for example,
> > `reencode-input-port`:
> >
> >> (read-line (reencode-input-port (open-input-bytes #"\xA3")
> >                                   "windows-1252"))
> > “£"
>
> Perfect!
>
> I went looking for a place where I might add a “windows-1252” search term,
> but it looks like it might be hard, since the list of supported encodings
> is apparently platform dependent. Would it make sense simply to attach a
> free-floating search tag of “windows-1252” to this part of the
> documentation?
>
> >
> > For handling e-mail, see also `generalize-encoding` from `net/unihead`.
>
> That probably saved me another half-hour of searching and head-scratching.
>
> Thanks!
>
> John
>
> (p.s.: no one whose mailer checks DMARC records will get this e-mail,
> sadly. Can’t wait to change to google groups.)
>
> >
> > At Tue, 3 Mar 2015 16:22:26 -0800, John Clements wrote:
> >> I'm trying to process a bunch of e-mail, and I've discovered that lots
> of
> >> it is encoded using the "windows-1252" charset.  It looks pretty
> >> straightforward to map this to unicode, but I thought I'd check: has
> anyone
> >> written this code already?
> >>
> >> John Clements
> >> ____________________
> >>  Racket Users list:
> >>  http://lists.racket-lang.org/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20150304/b7ec2c5c/attachment.html>

Posted on the users mailing list.