[plt-scheme] Regexp partially matches alternate (unexpected)
At Thu, 13 May 2010 23:29:19 -0400, Eli Barzilay wrote:
> > > (regexp-match #rx"(4)56|(.*)" "4ab")
> > ("4ab" "4" "4ab")
> >
> > I would have expected to get:
> > ("4ab" #f "4ab")
>
> This looks like a bug -- I tried `git bisect', and it was introduced
> in subversion r4379, which is now d3b96f936.
Fixed for the next version.
> > Is there any way to achieve the latter? What I'm really matching
> > against is something more like:
> >
> > > (regexp-match #px"(?:(?:(\\d)(\\d)(\\d))|(.*))" "4ab")
> > ("4ab" "4" #f #f "4ab")
> >
> > and, in the event that there are not 3 digits, I would expect #f
> > instead of the "4"?
>
> (Yeah, it looks like the same problem.)
Yes.
The problem was in parenthesized sub-patterns that have a fixed width
at the byte-string level. For example,
(regexp-match #rx"(.)56|(.*)" "4ab")
didn't have the problem, because "." at the byte-string level matches a
UTF-8 encoding of a character, which has a variable width. So, the
general form of sub-patterns handling was ok, but an optimized case
(for patterns matching a certain fixed length) was broken.
Sub-patterns in lookahead and lookbehind had similar issues, which are
also now fixed.