[racket] string-trim : an implementation & a question

From: Jon Zeppieri (zeppieri at gmail.com)
Date: Sat Apr 2 18:23:10 EDT 2011

On Sat, Apr 2, 2011 at 6:06 PM, Robby Findler
<robby at eecs.northwestern.edu>wrote:

> (oh, and I meant to add the usual "where are you test cases?!?!"
> comment here, but forgot.)
>
> Robby
>


Ugh, you're right.

My understanding is that the function is supposed to return a string equal
to the input string with leading and trailing whitespace removed. But I
wasn't the one who originally started the discussion; that was Richard
Hixson. I just got curious, because he wanted to avoid using regexps for
performance reasons (I think), and that made we wonder what the how large
the difference was.

But back to the function... Yes, that's broken. (Also, it turns out that
replacing #px with #rx may make the former function a lot faster, but it
doesn't actually work, at all.)

In the second function, the end parameter to the second use of 'scan' is
wrong, of course, since first non-whitespace character in the string may
also be the last.  So, changing the second string-trim function to:

(define (string-trim s)
  (define-syntax scan
    (syntax-rules ()
      ((_ s start end step)
       (for/first ((i (in-range start end step))
                   #:when (not (char-whitespace? (string-ref s i))))
         i))))

  (let* ((len (string-length s))
         (last-index (sub1 len))
         (start (or (scan s 0 len 1) 0))
         (end (or (scan s last-index (sub1 start) -1) last-index)))
    (substring s start (add1 end))))

... works on the the following test cases:

> (string-trim "")
""
> (string-trim "a")
"a"
> (string-trim "ab")
"ab"
> (string-trim " ab")
"ab"
> (string-trim "   ab")
"ab"
> (string-trim "   ab   ")
"ab"
> (string-trim "ab   ")
"ab"
> (string-trim " s sdf d  ")
"s sdf d"

... and the times aren't much altered.

-Jon



> On Sat, Apr 2, 2011 at 5:06 PM, Robby Findler
> <robby at eecs.northwestern.edu> wrote:
> > I've lost track of what the function is supposed to be doing, but your
> > two functions don't agree on the input "a ", I don't think. I get
> > this:
> >
> > (define (string-trim.1 s)
> >  (regexp-replace #px"^\\s*([^\\s]*)\\s*$" s "\\1"))
> >
> > (define (string-trim.2 s)
> >  (define-syntax scan
> >    (syntax-rules ()
> >      ((_ s start end step)
> >       (for/first ((i (in-range start end step))
> >                   #:when (not (char-whitespace? (string-ref s i))))
> >         i))))
> >
> >  (let* ((len (string-length s))
> >         (last-index (sub1 len))
> >         (start (or (scan s 0 len 1) 0))
> >         (end (or (scan s last-index start -1) last-index)))
> >    (substring s start (add1 end))))
> >
> >> (string-trim.2 "a ")
> > "a "
> >> (string-trim.1 "a ")
> > "a"
> >
> >
> > On Sat, Apr 2, 2011 at 5:03 PM, Jon Zeppieri <zeppieri at gmail.com> wrote:
> >> Actually #rx seems to be much faster than #px (in this case, at any
> rate),
> >> but it's still slower:
> >>> (test)
> >> cpu time: 1162 real time: 1181 gc time: 40
> >> cpu time: 230 real time: 230 gc time: 0
> >>> (test)
> >> cpu time: 1184 real time: 1198 gc time: 38
> >> cpu time: 258 real time: 259 gc time: 21
> >>> (test)
> >> cpu time: 1220 real time: 1544 gc time: 40
> >> cpu time: 233 real time: 233 gc time: 0
> >>
> >> On Sat, Apr 2, 2011 at 5:56 PM, Jon Zeppieri <zeppieri at gmail.com>
> wrote:
> >>>
> >>> I was a bit surprised to find that the scanning-by-hand approach really
> is
> >>> significantly faster than using regexps.
> >>> Between these two functions:
> >>> (define (string-trim s)
> >>>   (regexp-replace #px"^\\s*([^\\s]*)\\s*$" s "\\1"))
> >>> ... and ...
> >>> (define (string-trim s)
> >>>   (define-syntax scan
> >>>     (syntax-rules ()
> >>>       ((_ s start end step)
> >>>        (for/first ((i (in-range start end step))
> >>>                    #:when (not (char-whitespace? (string-ref s i))))
> >>>          i))))
> >>>
> >>>   (let* ((len (string-length s))
> >>>          (last-index (sub1 len))
> >>>          (start (or (scan s 0 len 1) 0))
> >>>          (end (or (scan s last-index start -1) last-index)))
> >>>     (substring s start (add1 end))))
> >>>
> >>> ... the latter is much faster. On 100000 iterations, using the test
> >>> string:
> >>>  "                                                      \n  \t foo bar
> >>> baz\n                                    \r   "
> >>> as input, I'm getting numbers like these (where the first time is for
> the
> >>> regexp function and the second is for the hand-scanning function):
> >>> > (test)
> >>> cpu time: 8003 real time: 8008 gc time: 0
> >>> cpu time: 256 real time: 257 gc time: 22
> >>> > (test)
> >>> cpu time: 8028 real time: 8025 gc time: 0
> >>> cpu time: 255 real time: 255 gc time: 22
> >>> > (test)
> >>> cpu time: 8418 real time: 8424 gc time: 0
> >>> cpu time: 260 real time: 260 gc time: 22
> >>> > (test)
> >>> cpu time: 8390 real time: 8401 gc time: 0
> >>> cpu time: 252 real time: 253 gc time: 20
> >>>
> >>>
> >>>
> >>> On Sat, Apr 2, 2011 at 5:20 PM, Richard Cleis <rcleis at mac.com> wrote:
> >>>>
> >>>> You can use an index to the string to find the location of your goal,
> >>>> then return the substring when you are done.
> >>>>
> >>>> rac
> >>>>
> >>>> On Apr 2, 2011, at 3:08 PM, Charles Hixson wrote:
> >>>>
> >>>> > This seems to be what I want the string-trim to do, but it seems
> that
> >>>> > all the string copying would be expensive.  Is there a way to
> improve it by
> >>>> > avoiding the string copying?
> >>>> >
> >>>> > My original inclination was to use a while loop with a test for
> >>>> > non-whitespace, but that appears to not be something scheme
> supports.
> >>>> >
> >>>> > (define (string-trim s)
> >>>> >    (let ( (l (string-length s) ) )
> >>>> >      (cond
> >>>> >        [ (= l 0) #f]
> >>>> >        [ (char-whitespace? (string-ref s (- l 1) ) )    (string-trim
> >>>> > (substring s 0 (- l 1) ) ) ]
> >>>> >        [else s]) ) )
> >>>> > _________________________________________________
> >>>> > For list-related administrative tasks:
> >>>> > http://lists.racket-lang.org/listinfo/users
> >>>>
> >>>> _________________________________________________
> >>>>  For list-related administrative tasks:
> >>>>  http://lists.racket-lang.org/listinfo/users
> >>>
> >>
> >>
> >> _________________________________________________
> >>  For list-related administrative tasks:
> >>  http://lists.racket-lang.org/listinfo/users
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20110402/b8fb38bf/attachment.html>

Posted on the users mailing list.