<div>Hello,</div><div><br></div><div>I'm running a (simple) web scrapper in a page written in iso-8859-1 (declared in source using a meta tag).</div><div>The page contains links like this:</div><div><br></div><div>"<a href="http://www.ufrj.br/editais.php?tp=Acad" target="_blank">http://www.ufrj.br/editais.php?tp=Acad</a><b>%EA</b>micos&no=Cursos&idtp=4"</div>
<div><br></div><div>In one point the code calls:</div><div>(combine-url/relative current-url resource)</div><div><br></div><div>Where current-url is a Racket URL and resource is the aforementioned string.</div><div>I then get the error:</div>
<div><br></div><div>bytes->string/utf-8: string is not a well-formed UTF-8 encoding: #"Acad\352micos"</div><div><br></div><div><br></div><div>This seems to be a problem with uri-decode.</div><div><br></div><div>
(uri-decode resource)</div><div><br></div><div>bytes->string/utf-8: string is not a well-formed UTF-8 encoding: #"<a href="http://www.ufrj.br/editais.php?tp=Acad%5C352micos&no=Cursos&idtp=4" target="_blank">http://www.ufrj.br/editais.php?tp=Acad\352micos&no=Cursos&idtp=4</a>"</div>
<div><br></div><div><br></div><div>I looked at the source code of uri-decode to see that after decoding the percent encoded string, a call to bytes->string/utf-8 expects the string to be UTF-8 encoded... but there's no way to tell uri-decode to use a different encoding.</div>
<div><br></div><div>I copied the relevant portion of code from uri-codec-unit.rkt from the collects/net, and verified that I can change bytes->string/utf-8 => bytes->string/latin-1 and get it to work... but that's like cheating :)</div>
<div><br></div><div>AFAICT Chrome and Firefox handles the URL "<a href="http://www.ufrj.br/editais.php?tp=Acad" target="_blank">http://www.ufrj.br/editais.php?tp=Acad</a><b>%EA</b>micos&no=Cursos&idtp=4" as well as it's UTF-8 %-encoded equivalent "<a href="http://www.ufrj.br/editais.php?tp=Acad" target="_blank">http://www.ufrj.br/editais.php?tp=Acad</a><b>%C3%AA</b>micos&no=Cursos&idtp=4", with the difference that the second appears as "<a href="http://www.ufrj.br/editais.php?tp=Acad%C3%AAmicos&no=Cursos&idtp=4" target="_blank">http://www.ufrj.br/editais.php?tp=Acadêmicos&no=Cursos&idtp=4</a>" (but when copied->pasted is still <b>%C3%AA</b> instead of <b>ê</b>).</div>
<div><br></div><div><br></div><div>How could I make uri-decode understand an encoding other than UTF-8?</div><div><br></div><div><br></div><div>Thanks,</div><br clear="all">Rodolfo Carvalho<br>