[racket-dev] url->string: what do we do?

From: Robby Findler (robby at eecs.northwestern.edu)
Date: Wed Mar 28 13:02:40 EDT 2012

Eli, I think your comments generally make sense here, but I don't see
how they help us make concrete resolving what to do with string->url
and the commit.

As things stand, my inclination would be to try to find a simpler
regexp that doesn't cost as much in the contract and leave that
checked in until you get to the changes that you want to do.

This seems to put us in the best position of letting Andy continue to
do testing and you continue to improve url->string.

Also, just to clarify, I don't think the goal here is to get a
completely precise spec for url->string, only to improve it enough
that random checking can be guided more effectively (currently we get
a lot of failures that aren't contract violations and those lead to
useless tests).

Robby

On Wed, Mar 28, 2012 at 8:34 AM, Eli Barzilay <eli at barzilay.org> wrote:
> Yesterday, Andy Gocke wrote:
>> [...]  This is especially relevant for functions like string->number
>> because the most obvious implementation checks validity during
>> parsing -- checking the validity and parsing basically duplicate the
>> function.
>
> And that makes most of my point.
>
> The thing is that `string->url' is basically *just* a parser -- it
> does very little after matching the regexp.  I therefore view the
> commit as adding a contract to, say, `read-xml', where the contract
> runs the function to see that the input is valid.
>
> An even more extreme example would be `get-pure-port': if you really
> want a complete specification of the domain in a contract, then the
> contract should make sure that the server is reachable, and that it
> returns a valid page.  Combine this with parsing the page, and how
> this is not really a great way to run code (ATM!) should be clear.
> Besides the issue of doing a bunch of work twice, the contract would
> still be broken since having a valid server and/or a page now doesn't
> mean that it's going to be valid on the next attempt.  To make this
> practical, you'd need some way to expose values that are computed
> as part of the contract.  ("Reify" feels wrong to me in this
> context...)
>
> That's why I added the above "ATM".  There is an obvious appeal in
> doing this -- having all error handling in specific pieces of code and
> "floating" them upwards sounds tempting *if* there's some way to do it
> right.  I suspect that such an exposure of the contract results is
> just one small step in getting this.  I'm also not sure that it's
> doable in a way that actually leads to a practical benefit.  This is
> similar to me doubting the theoretical utility in running a parser
> twice: on one level you get your guaranteed, nicely total function,
> but on the level of providing that guarantee, you get the original
> problem.  (And in terms that I'm used to, this is switching the same
> work to your well-formedness goal, and that buys nothing in terms of
> getting things done.)
>
> IMO, this problem is fundamental enough that it shows up in many
> contexts.  One of them is already visible in the `string->url'
> example.  The new documentation reads:
>
>  | url-regexp : regexp?
>  |
>  |   This is a regular expression based on the one in Appendix B of
>  |   RFC 3986 for recognizing urls.  This is the precise regexp:
>  |
>  |   ^(?:([^:/?#]*):)?(?://(?:([^/?#@]*)@)?([^/?#:]*)?(?::([0-9]*))?)
>  |   ?([^?#]*)(?:\?([^#]*))?(?:#(.*))?$
>
> (Pre-disclaimer: the following is not said in a negative way.)
>
> At least in my view, this documentation is is useless.  It's true that
> it's precise, but as a user of this code, I get nothing out of it.  I
> can't even *use* that regexp (the one quoted in the docs) since it
> looks like something that can easily change, so I better use the
> `url-regexp' binding and not the quoted regexp.
>
> But the deeper reason that this is not useful to me is that it
> essentially spells out the parser code -- and documenting a function
> using its own code is (IMO) often a sign that the abstraction is
> questionable.
>
>
> But there are a few additional problems with this change that I see:
>
> * Beyond quoting it in the documentation, exposing the regexp means
>  that it becomes part of the interface.  This means that I now cannot
>  re-implement the code in any way other than matching a regexp.
>
> * It is still partial.  For example, this
>
>    -> (string->url "1:/")
>    ; Invalid URL string; bad scheme "1": "1:/"
>
>  is still not a contract error.  (And I can't see an obvious way to
>  add it to the regexp, maybe with some lookahead tricks.)
>
>  Another example is the host part, which is not even checked, but
>  this is just sloppiness (= deferring it to network errors that will
>  happen with malformed hosts).  And BTW, doing that means that the
>  contract becomes platform dependent:
>
>    -> (file-url-path-convention-type 'unix)
>    -> (url-host (string->url "file://x:&x/baz"))
>    "x"
>    -> (file-url-path-convention-type 'windows)
>    -> (url-host (string->url "file://x:&x/baz"))
>    ""
>
> * More importantly, and possibly related to the first bullet, it
>  stands in the way of improving this code.  There is a major problem
>  in the design of the code -- it parses all urls as `http'.  A proper
>  way to deal with it is to choose a specific parser based on the
>  schema.  For example, as it looks now, I can't change it to properly
>  treat "mailto:..." urls.
>
>  That's not theoretical -- I planned on doing that extension, and now
>  it is impossible to do it in a nice way.
>
> --
>          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
>                    http://barzilay.org/                   Maze is Life!
> _________________________
>  Racket Developers list:
>  http://lists.racket-lang.org/dev


Posted on the dev mailing list.