[racket] Some design "whys" of regexps in Racket

From: Eli Barzilay (eli at barzilay.org)
Date: Fri Jun 3 23:40:37 EDT 2011

30 minutes ago, Rodolfo Carvalho wrote:
> Eli says that
> 
> > (BTW, Racket's solution is something that is done in many other
> > languages too.)
> 
> I come from Python where I can write
> 
> >>> re.findall("\d{2}", "06/03/2011")
> ['06', '03', '20', '11']
> 
> And printing the string that I used for my regexp gives:
> 
> >>> print "\d{2}"
> \d{2}

At a high level, pything *is* doing the same -- it uses strings to
specify regexps, so both have the *same* syntax.  So this:

> That is writing strings is not exactly the same as writing "strings
> for a regexp".

is wrong -- it's the same syntax for both.  (At least AFAICT.)


But -- it happens that python chose to go with the *extremely* messy
(IMO) attempt at making backslashes "easy" to write.  It does that by
deciding that some backslash-chars are special escape sequences, but
for other uses of backslashes, you get a plain backslash where racket
would throw a reading error. So you get this in python:

  >>> print "\a"
  ^G                   <-- escape
  >>> print "\d"
  \d                   <-- no escape
  >>> print " \ "
   \                   <-- no escape
  >>> print " \\ "
   \                   <-- escape, resulting in the same thing
  >>> print "\"
  [SyntaxError: ...]   <-- surprise
  >>> ("\\" + "y") == "\y"
  True                 <-- another surprise

(I've had some more examples that were almost identical to what
Matthew posted...  I just used "a\a", and you can imagine the
disasters.)

Then, to make things "better", you get a printed form for strings,
which always backslash-escape the backslashes:

  >>> "\d"
  '\\d'

but of course you also get a second set of quotes (making things "even
more convenient"), and the printout can't see what the input
expression was, so it gets printed with single quotes.

Here's an interesting experiment to do -- find a complete newbie, and
explain that in python when you type values you see them printed back:

  >>> 1
  1

Now see how much work you'll need to invest to explain that gem:

  >>> "\d"
  '\\d'

or, even better:

  >>> """\d"""
  '\\d'


> If we are to exploit this consistency, then I see changing my head
> into typing double backslashes for special regexps constructs a
> "price worth paying" (given a previous background). For fresh minds,
> this sounds like a very good idea.

If you get the impression that I dislike what python does with
quoting, then that would be a correct one...  IMO, quoting is very
difficult, and the art of designing a sane syntax for quoting is
something that is *extremely* difficult.  To make things worse, it's
something that languages frequently have their own solution, and some
of them are horrible.  (See some code that does quoting in unix shells
-- contrary to first impression, things like '"'"' are not some
obfuscated jokes.)  Python went with a bunch of syntaxes, but then
tried to make things uniform in some obscure but mostly useless way.
(E.g., the cosmetic difference between "..." and """...""".  If anyone
knows of a *good* reason for that thing, I'll be happy to hear it.)

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!


Posted on the users mailing list.