[racket-dev] URL escaping: question for web experts

From: Greg Hendershott (greghendershott at gmail.com)
Date: Mon Dec 17 12:04:30 EST 2012

p.s. Also the current docs[1] say this in the second paragraph:

    The URI encoding uses allows a few characters to be
    represented as-is: a through z, A through Z, 0-9, -,
     _, ., !, ~, *, ', ( and ).

But this in the final sentence:

    In additon, since there appear to be some brain-dead
    decoders on the web, the library also encodes !, ~,
    ', (, and ) using their hex representation, which is
    the same choice as made by the Java’s URLEncoder.

Which seems to be contradictory with respect to !, ~, ', ( and ).

[1]: http://docs.racket-lang.org/net/uri-codec.html

On Mon, Dec 17, 2012 at 11:55 AM, Greg Hendershott
<greghendershott at gmail.com> wrote:
> Although I'm hardly a web "expert", I think net/uri-codec is currently
> a little confusing.
>
> I get the impression that it was originally written prior to 2005,
> because the detailed introduction talks only about RFCs 1738 and
> 2396.[1]
>
> It looks like perhaps functions such as uri-path-segment-encode were
> added at a later date, to support RFC 3986. Although these functions'
> docs tersely link to RFC 3986, the overall net/uri-codec introduction
> wasn't revised accordingly, nor is there a simple explanation like
> "these also encode #\( #\) ...".  (As a result, I actually ended up
> writing my own variation because I overlooked them.)
>
> Aside from the history of the documentation and organization, another
> point is the treatment of +, which the docs say intentionally doesn't
> follow RFC 2396, but don't really explain why.  (One of my earliest
> experiments with Racket was a simple web crawler, and this #\+ <->
> #\space translation caused difficulties (although it's possible I was
> confused in other ways).)
>
>
> Wikipedia (usual caveats apply) says RFC 3986 is the the current
> standard since 2005.[2]
>
> I almost wonder if there should be a brand-new module that implements
> RFC 3986 strictly. (Either just that, or, any options/parameters
> default to 3986). With the current net/uri-codec deprecated but
> preserved for backward compatibility.
>
> I wonder if that would be best because the functions and documentation
> may already be confusing. And this is a topic where it's easy for
> people to get confused to begin with and choose the wrong function.
>
>
> [1]: http://docs.racket-lang.org/net/uri-codec.html
>
> [2]: http://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_in_a_URI
>
> On Mon, Dec 17, 2012 at 9:59 AM, Eli Barzilay <eli at barzilay.org> wrote:
>> For many people there is a constant source of annoyance when you
>> copy+paste doc URLs into a markdown context as with stackoverflow and
>> others.  The problem is that these URLs have parens in them and at
>> least in Chrome, the copied URL still has them -- and because markdown
>> texts use parens for URLs "[text](url)" they get confused which means
>> that you have to manually replace parens with %28 and %29.
>>
>> Danny submitted a pull request that eventually got changed by Matthew
>> into a new parameter that controls which characters get encoded by
>> `net/uri-codec', so it can escape these too.  The result on Chrome is
>> that the copied URL has the escapes instead of parens, and clicking
>> such a URL makes the copy-able address have the escapes too.  The
>> actuall page that is displayed is still the same one, of course, it's
>> just weird that Chrome has a certain context where the original URL
>> string is preserved as is.  (It even considered the escaped URL as one
>> that I didn't visit, even though I visited the one with the unescaped
>> parens.)
>>
>> In any case, given all of this I thought that maybe the default mode
>> could do the extra escaping -- it seems to me that there is no damage
>> with doing that, since in theory every character could be escaped
>> anyway.  There's a minor overhead of a few extra characters, but
>> there's the above benefit of doing it (which might be a temporary
>> thing for all I know).
>>
>> Neither Matthew nor I feel confident enough to have this encoding be
>> the default without consulting some potential web standard gurus.
>>
>> So?
>>
>> --
>>           ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
>>                     http://barzilay.org/                   Maze is Life!
>> _________________________
>>   Racket Developers list:
>>   http://lists.racket-lang.org/dev


Posted on the dev mailing list.