[plt-scheme] Regexp partially matches alternate (unexpected)

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Sat May 15 10:42:31 EDT 2010

At Thu, 13 May 2010 23:29:19 -0400, Eli Barzilay wrote:
> > > (regexp-match #rx"(4)56|(.*)" "4ab")
> > ("4ab" "4" "4ab")
> > 
> > I would have expected to get:
> > ("4ab" #f "4ab")
> This looks like a bug -- I tried `git bisect', and it was introduced
> in subversion r4379, which is now d3b96f936.

Fixed for the next version.

> > Is there any way to achieve the latter? What I'm really matching
> > against is something more like:
> > 
> > > (regexp-match #px"(?:(?:(\\d)(\\d)(\\d))|(.*))" "4ab")
> > ("4ab" "4" #f #f "4ab")
> > 
> > and, in the event that there are not 3 digits, I would expect #f
> > instead of the "4"?
> (Yeah, it looks like the same problem.)


The problem was in parenthesized sub-patterns that have a fixed width
at the byte-string level. For example,

  (regexp-match #rx"(.)56|(.*)" "4ab")

didn't have the problem, because "." at the byte-string level matches a
UTF-8 encoding of a character, which has a variable width. So, the
general form of sub-patterns handling was ok, but an optimized case
(for patterns matching a certain fixed length) was broken.

Sub-patterns in lookahead and lookbehind had similar issues, which are
also now fixed.

Posted on the users mailing list.