[racket] uri-decode and non-UTF-8 percent-encoded links
Hello,
I'm running a (simple) web scrapper in a page written in iso-8859-1
(declared in source using a meta tag).
The page contains links like this:
"http://www.ufrj.br/editais.php?tp=Acad*%EA*micos&no=Cursos&idtp=4"
In one point the code calls:
(combine-url/relative current-url resource)
Where current-url is a Racket URL and resource is the aforementioned string.
I then get the error:
bytes->string/utf-8: string is not a well-formed UTF-8 encoding:
#"Acad\352micos"
This seems to be a problem with uri-decode.
(uri-decode resource)
bytes->string/utf-8: string is not a well-formed UTF-8 encoding: #"
http://www.ufrj.br/editais.php?tp=Acad\352micos&no=Cursos&idtp=4"
I looked at the source code of uri-decode to see that after decoding the
percent encoded string, a call to bytes->string/utf-8 expects the string to
be UTF-8 encoded... but there's no way to tell uri-decode to use a different
encoding.
I copied the relevant portion of code from uri-codec-unit.rkt from the
collects/net, and verified that I can change bytes->string/utf-8
=> bytes->string/latin-1 and get it to work... but that's like cheating :)
AFAICT Chrome and Firefox handles the URL "
http://www.ufrj.br/editais.php?tp=Acad*%EA*micos&no=Cursos&idtp=4" as well
as it's UTF-8 %-encoded equivalent "http://www.ufrj.br/editais.php?tp=Acad*
%C3%AA*micos&no=Cursos&idtp=4", with the difference that the second appears
as "http://www.ufrj.br/editais.php?tp=Acadêmicos&no=Cursos&idtp=4" (but when
copied->pasted is still *%C3%AA* instead of *ê*).
How could I make uri-decode understand an encoding other than UTF-8?
Thanks,
Rodolfo Carvalho
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20110924/5a80da9b/attachment.html>