[racket] uri-decode and non-UTF-8 percent-encoded links

From: Jay McCarthy (jay.mccarthy at gmail.com)
Date: Mon Sep 26 12:57:11 EDT 2011

There is not, and I think this is a major flaw throughout a lot of the
Racket net libraries. I have gone through many efforts to use bytes
throughout the Web server to avoid issues like this, but the URL
module is one place where it hurts.

I think it should be written to use bytes internally and provide the
UTF-8 string versions for compatibility. Unfortunately, I don't have
the time to fix it now.

Jay

On Sat, Sep 24, 2011 at 3:56 PM, Rodolfo Carvalho <rhcarvalho at gmail.com> wrote:
> Hello,
> I'm running a (simple) web scrapper in a page written in iso-8859-1
> (declared in source using a meta tag).
> The page contains links like this:
> "http://www.ufrj.br/editais.php?tp=Acad%EAmicos&no=Cursos&idtp=4"
> In one point the code calls:
> (combine-url/relative current-url resource)
> Where current-url is a Racket URL and resource is the aforementioned string.
> I then get the error:
> bytes->string/utf-8: string is not a well-formed UTF-8 encoding:
> #"Acad\352micos"
>
> This seems to be a problem with uri-decode.
> (uri-decode resource)
> bytes->string/utf-8: string is not a well-formed UTF-8 encoding:
> #"http://www.ufrj.br/editais.php?tp=Acad\352micos&no=Cursos&idtp=4"
>
> I looked at the source code of uri-decode to see that after decoding the
> percent encoded string, a call to bytes->string/utf-8 expects the string to
> be UTF-8 encoded... but there's no way to tell uri-decode to use a different
> encoding.
> I copied the relevant portion of code from uri-codec-unit.rkt from the
> collects/net, and verified that I can change bytes->string/utf-8
> => bytes->string/latin-1 and get it to work... but that's like cheating :)
> AFAICT Chrome and Firefox handles the
> URL "http://www.ufrj.br/editais.php?tp=Acad%EAmicos&no=Cursos&idtp=4" as
> well as it's UTF-8 %-encoded
> equivalent "http://www.ufrj.br/editais.php?tp=Acad%C3%AAmicos&no=Cursos&idtp=4",
> with the difference that the second appears as
> "http://www.ufrj.br/editais.php?tp=Acadêmicos&no=Cursos&idtp=4" (but when
> copied->pasted is still %C3%AA instead of ê).
>
> How could I make uri-decode understand an encoding other than UTF-8?
>
> Thanks,
> Rodolfo Carvalho
>
> _________________________________________________
>  For list-related administrative tasks:
>  http://lists.racket-lang.org/listinfo/users
>



-- 
Jay McCarthy <jay at cs.byu.edu>
Assistant Professor / Brigham Young University
http://faculty.cs.byu.edu/~jay

"The glory of God is Intelligence" - D&C 93



Posted on the users mailing list.