[racket-dev] no capturing groups in regexp-split? [was Re: [PATCH] add regexp-split]

From: Marijn (hkBst at gentoo.org)
Date: Fri Dec 30 04:58:44 EST 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 30-12-11 09:32, Eli Barzilay wrote:
> This doesn't look like an issue that is related to guile, just that
> he chose python as the goal...  The first other random example I
> tried was `split-string' in Emacs, which did the same thing as
> Racket.

They may choose python's version as the goal. It doesn't look like
they have looked very hard as of yet at what else is out there.
Probably because they are expecting compatibility between most
implementations.

> 
>> Welcome to Racket v5.2.0.7.
>>> (regexp-split "([^0-9])"  "123+456*/")
>> '("123" "456" "" "")
>> 
>> should it be considered a bug in racket that it doesn't support 
>> capturing groups in regexp-split?
> 
> No.
> 
> 
>> Without the capturing group the results are identical: [...]
> 
> Which is expected.

Good, just establishing a baseline here, but it is good that some
compatibility is *expected*. How nice is that? Since we're expecting
compatibility between python and racket, I guess it goes without
saying that racket's and guile's regexp-split should be compatible as
well. R7RS Large may standardize a regular expression library, and we
can make that easier by reducing incompatibilities between schemes. We
can all grow from examining our incompatibilities, discussing them and
sometimes resolving them.

> Python does something which is IMO very weird:
> 
>>>> re.split("([^0-9])", "123+456*/")
> ['123', '+', '456', '*', '', '/', '']
> 
> It's even more confusing with multiple patterns:
> 
>>>> re.split("([^0-9]([0-9]))", "123+456*/")
> ['123', '+4', '4', '56*/']
> 
> There's probably uses for that -- at least for the simple version
> with a single group around the whole regexp, but that's some hybrid
> of `regexp-split' and `regexp-match*': it returns something that 
> interlevase them, which can be useful, but I'd rather see it with
> a different name.

Yes, I agree that I find it a bit weird as well.

You don't lose anything by supporting this though, since you can
always use a non-capturing group, but I do agree that it can be
considered an inappropriate extension of the meaning of regexp-split.
I'll be sure to raise these issues on the guile list.

> We've talked semi-recently about adding an option to
> `regexp-match*' so it can return the lists of matches for each
> pattern, perhaps add another option for returning the unmatched
> sequences between them, and give the whole thing a new name?
> (Something that indicates it being the multitool version of all of
> these.)

Interesting.

Marijn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk79i1QACgkQp/VmCx0OL2zI4gCgtLLd3b6vgzaksYSA7wsZksHA
yeIAoJJ6G7AcimN3OhtxFMvN8Xf7TdrH
=1+Ax
-----END PGP SIGNATURE-----


Posted on the dev mailing list.