[racket-dev] `racket/string' extensions

From: Eli Barzilay (eli at barzilay.org)
Date: Thu Apr 19 12:28:39 EDT 2012

Sorry for the new thread, but this is a kind of a summary on the
extensions that I think we're converging to, with a way to resolve the
exact meaning of arguments.  Please read through and reply if you see
any problems with it.  There are three specific questions, which are
marked with [*1*]...[*3b*] -- I'd appreciate suggestions for them.


Starting with the problem of the argument, I think that the best
choice is to go with plain ones -- no implicit `+', and no strings as
bags of characters.  (This is option (c) in the other thread.)

Two rationales:

  * These functions are supposed to be simple, so a simple rule like
    that works nicely to that end.

  * Uses strings for what they are: an ordered sequence of characters.
    Going with a bag-of-characters is really abusing the string type.
    Adding an implicit `+' is complicating things since it interprets
    a string as a kind of a pattern.

But to allow other uses, make these arguments a string *or* a regexp,
where a regexp is taken as-is.  This leads to another simplicity point
in this design:

  * These functions are mostly similar to the regexp ones, except that
    the implicit coercion from a string to a pattern happens with
    `regexp-quote' rather than with `regexp'.

It also means that when you want something that is not a plain string,
you just use a regexp.  This doesn't necessarily goes back to the full
regexp versions with the implied complexity.  A few examples:

  * The default argument for a pattern that serves as a separator (as
    in `string-trim' and `string-split') is a regexp: #px"\\s+".  So
    newbies get to use them without learning them.

  * If there's a need for something different in the future, say one
    or more spaces and tabs, then making regexps a valid input means
    that we could add a binding for such regexps, so newbies can now
    do something like:

      (string-split string spaces-or-tabs)

    and still not worry about regexps.

  * Even if there's some obvious need for the bag-of-chars thing, it
    could be added as a function:

      (string-split string (either " " "\t"))

Note that I'm not suggesting adding these last two items -- I'm just
saying that accepting regexps means that such extensions are easier to
do in the future.


The suggested functions are (these are skeletons, they'll also have
keyword arguments for some more tweaks):

  (string-trim str [sep #px"\\s+"])
    Removes occurrences of `sep' from the beginning and end of `str'.
    (Keywords can make it do only one side.)  This is already
    implemented (but not pushed).  I will need to change it though, in
    subtle ways due to the new meaning of the `sep' argument.

  (string-normalize-spaces str [sep #px"\\s+"])
    Replaces occurrences of `sep' with a space, trimming it at the
    edges.  (Keywords can disable the trimming, and can make it use a
    different character to substitute.)  This is also already
    implemented but will need to change in subtle ways as the last
    one.

  (string-split str [sep #px"\\s+"])
    Splits `str' on occurrences of `sep'.  Unclear whether it should
    do that with or without trimming, which affects keeping a
    first/last empty part.  [*1*] Possible solution: make it take a
    `#:trim?' keyword, in analogy to `string-normalize-spaces'.  This
    would make `#t' the obvious choice for a default, which means that
      (string-split ",,foo, bar," ",") -> '("foo" " bar")

  (string-replace str from to [start 0] [end (string-length str)])
    Simple wrapper that quotes the `from' and `to'.  Note the
    different order argument which is supposed to make this be more
    like common functions.  Another rationale for this difference:
    these functions focus on the string, rather than on the regexp.

  (string-index str sub [start 0] [end (string-length str)])
    Looks for occurrences of `sub' in `str', returns the index if
    found, #f otherwise.  [*2*] I'm not sure about the name, maybe
    `string-index-of' is better?

  (list-index list elt)
    Looks for `elt' in `list'.  This is a possible extension for
    `racket/list' that would be kind of obvious with adding the above.
    [*3*] I'm not sure if it should be added, but IIRC it was
    requested a few times.  If it does get added, then there's another
    question for how far the analogy goes: [*3a*] Should it take a
    start/end index too?  [*3b*] Should it take a list of elements and
    look for a matching sublist instead (which is not a function that
    is common to ask for, AFAICT)?

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                    http://barzilay.org/                   Maze is Life!

Posted on the dev mailing list.