[plt-scheme] Possible bug in pregexp-split in 370.

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Wed Jun 13 18:08:12 EDT 2007

At Wed, 13 Jun 2007 10:12:03 -0700, "Jon Philpott" wrote:
> Hi All,
> 
> According to the pregxp documentation:
> 
> (pregexp-split "" "smithereens")
> => ("s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s")
> 
> According to mzscheme 370:
> 
> $ /usr/local/plt/bin/mzscheme
> Welcome to MzScheme v370 [3m], Copyright (c) 2004-2007 PLT Scheme Inc.
> > (require (lib "pregexp.ss"))
> >  (pregexp-split "" "smithereens")
> regexp-split: pattern matched a zero-length substring
> 
> ... and mzscheme 352 (from ubuntu):
> 
> Welcome to MzScheme version 352, Copyright (c) 2004-2006 PLT Scheme Inc.
> > (require (lib "pregexp.ss"))
> > (pregexp-split "" "smithereens")
> ("s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s")
> 
> Did the behaviour of the function change and documentation didn't?

That's right. The behavior changed in v360 when the `regexp' and
`pregexp' worlds were unified. The docs didn't get updated correctly.

I now think that the implementation should change back to the old
behavior, and the docs should be fixed to accurately describe that
behavior.


The docs say

 If the first argument can match an empty string, then the list of all
 the single-character substrings is returned.

but in 352:

 > (pregexp-split "a*" "banana split")
 ("b" "n" "n" " " "s" "p" "l" "i" "t")

That is, "a*" can match the empty string, but the result is not all of
the single-character substrings.

At the time of merging, I actually thought that `pregexp-split' needed
a predicate "can match an empty substring?", and that predicate isn't
so easy to define (given that regexps are not actually restricted to
regular expressions). That chain of reasoning lead me to prefer the
`regexp-split' behavior.

But, with examples like the one above, I now see that the intent is a
bit different than what the docs say. I'm still not sure exactly how to
describe that intent precisely and succinctly, but I'll sort it out.


Matthew



Posted on the users mailing list.