[racket] uri-decode and non-UTF-8 percent-encoded links

From: Rodolfo Carvalho (rhcarvalho at gmail.com)
Date: Sat Sep 24 17:56:48 EDT 2011

Previous message: [racket] Origin of ~a, ~s, ~v
Next message: [racket] uri-decode and non-UTF-8 percent-encoded links
Messages sorted by: [date] [thread] [subject] [author]

Hello,

I'm running a (simple) web scrapper in a page written in iso-8859-1
(declared in source using a meta tag).
The page contains links like this:

"http://www.ufrj.br/editais.php?tp=Acad*%EA*micos&no=Cursos&idtp=4"

In one point the code calls:
(combine-url/relative current-url resource)

Where current-url is a Racket URL and resource is the aforementioned string.
I then get the error:

bytes->string/utf-8: string is not a well-formed UTF-8 encoding:
#"Acad\352micos"


This seems to be a problem with uri-decode.

(uri-decode resource)

bytes->string/utf-8: string is not a well-formed UTF-8 encoding: #"
http://www.ufrj.br/editais.php?tp=Acad\352micos&no=Cursos&idtp=4"


I looked at the source code of uri-decode to see that after decoding the
percent encoded string, a call to bytes->string/utf-8 expects the string to
be UTF-8 encoded... but there's no way to tell uri-decode to use a different
encoding.

I copied the relevant portion of code from uri-codec-unit.rkt from the
collects/net, and verified that I can change bytes->string/utf-8
=> bytes->string/latin-1 and get it to work... but that's like cheating :)

AFAICT Chrome and Firefox handles the URL "
http://www.ufrj.br/editais.php?tp=Acad*%EA*micos&no=Cursos&idtp=4" as well
as it's UTF-8 %-encoded equivalent "http://www.ufrj.br/editais.php?tp=Acad*
%C3%AA*micos&no=Cursos&idtp=4", with the difference that the second appears
as "http://www.ufrj.br/editais.php?tp=Acadêmicos&no=Cursos&idtp=4" (but when
copied->pasted is still *%C3%AA* instead of *ê*).


How could I make uri-decode understand an encoding other than UTF-8?


Thanks,

Rodolfo Carvalho
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20110924/5a80da9b/attachment.html>

Posted on the users mailing list.

Previous message: [racket] Origin of ~a, ~s, ~v
Next message: [racket] uri-decode and non-UTF-8 percent-encoded links
Messages sorted by: [date] [thread] [subject] [author]