[racket] Some design "whys" of regexps in Racket

From: Eli Barzilay (eli at barzilay.org)
Date: Fri Jun 3 22:57:46 EDT 2011

15 minutes ago, Rodolfo Carvalho wrote:
> Hello,
> 
> I'm curious about 2 design decisions made:

[Different answers from Jay's:]

> 1) Why do I have to escape things like "\d{2}" -> "\\d{2}"?

A backslash is very commonly used as an escape characters in regexps,
and it's also commonly used as an escape character in strings.  So
when you want to specify a string that has backslash-d, you need to
write "\\d" -- and the syntax of a regexp is essentially a string with
an `#rx' or `#px' prefix.

There could be another syntax -- it would probably be easy to
implement something like #rx/blah/ where the "blah" part is written
literally, without string escapes.  But then you'd need to come up
with some way to represent the charcters that escapes are used for.
For example, you need to make up some way to write newlines since you
won't have "\n", and a way to write random character instead of things
like "\0".  This will be difficult -- consider something like "\1"
which is the way to write a string with a character that is
represented by 1 -- the same syntax in regexps is commonly referring
to the 1st matched subexpression.  If your syntax follows regexp
conventions, you get things like #rx/(.)\1/ but not the former use, so
you'll need to find an alternative and most likely end up inventing
your own syntax.  It's a possible solution, but when programmers are
dealing with complex regexps, they really don't need one more new
thing to learn.

(BTW, Racket's solution is something that is done in many other
languages too.)


> 2) Why there are two kinds of regexps, #rx and #px?
> #rx"\\d{2}" doesn't work because the curly braces are part of just
> pregexps grammar...

As you note, there are differences in the regexp syntax.  When the
regexp system was extended to do the same kind of things that pregexps
do, there was a question of how to do the extension.  One option was
to have the extended syntax replace the current one, which means that
if you happened to write something like #rx"a{1}" and expected it to
match a,open-brace,1,close-brace, then your code will now break.  So
#px was added as the extended syntax.  Otherwise there is no
difference, since they both get "compiled" to the same thing.  (So
it's perfectly fine to always use #px.)

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!


Posted on the users mailing list.