[racket] string-trim : an implementation & a question

From: Jon Zeppieri (zeppieri at gmail.com)
Date: Sat Apr 2 18:38:19 EDT 2011

Ha. But I should have tested the first function, right? Yes, that was also
broken.

(define (string-trim s)
  (regexp-replace #px"^\\s*(.*?)\\s*$" s "\\1"))

... passes the test cases and is a lot faster than the broken version; it's
now a little less than a 2x difference:

> (test)
cpu time: 426 real time: 437 gc time: 22
cpu time: 231 real time: 230 gc time: 0
> (test)
cpu time: 422 real time: 431 gc time: 21
cpu time: 231 real time: 231 gc time: 0
> (test)
cpu time: 450 real time: 456 gc time: 21
cpu time: 237 real time: 261 gc time: 0





On Sat, Apr 2, 2011 at 6:23 PM, Jon Zeppieri <zeppieri at gmail.com> wrote:

>
>
> On Sat, Apr 2, 2011 at 6:06 PM, Robby Findler <robby at eecs.northwestern.edu
> > wrote:
>
>> (oh, and I meant to add the usual "where are you test cases?!?!"
>> comment here, but forgot.)
>>
>> Robby
>>
>
>
> Ugh, you're right.
>
> My understanding is that the function is supposed to return a string equal
> to the input string with leading and trailing whitespace removed. But I
> wasn't the one who originally started the discussion; that was Richard
> Hixson. I just got curious, because he wanted to avoid using regexps for
> performance reasons (I think), and that made we wonder what the how large
> the difference was.
>
> But back to the function... Yes, that's broken. (Also, it turns out that
> replacing #px with #rx may make the former function a lot faster, but it
> doesn't actually work, at all.)
>
> In the second function, the end parameter to the second use of 'scan' is
> wrong, of course, since first non-whitespace character in the string may
> also be the last.  So, changing the second string-trim function to:
>
> (define (string-trim s)
>   (define-syntax scan
>     (syntax-rules ()
>       ((_ s start end step)
>        (for/first ((i (in-range start end step))
>                    #:when (not (char-whitespace? (string-ref s i))))
>          i))))
>
>   (let* ((len (string-length s))
>          (last-index (sub1 len))
>          (start (or (scan s 0 len 1) 0))
>          (end (or (scan s last-index (sub1 start) -1) last-index)))
>     (substring s start (add1 end))))
>
> ... works on the the following test cases:
>
> > (string-trim "")
> ""
> > (string-trim "a")
> "a"
> > (string-trim "ab")
> "ab"
> > (string-trim " ab")
> "ab"
> > (string-trim "   ab")
> "ab"
> > (string-trim "   ab   ")
> "ab"
> > (string-trim "ab   ")
> "ab"
> > (string-trim " s sdf d  ")
> "s sdf d"
>
> ... and the times aren't much altered.
>
> -Jon
>
>
>
>> On Sat, Apr 2, 2011 at 5:06 PM, Robby Findler
>> <robby at eecs.northwestern.edu> wrote:
>> > I've lost track of what the function is supposed to be doing, but your
>> > two functions don't agree on the input "a ", I don't think. I get
>> > this:
>> >
>> > (define (string-trim.1 s)
>> >  (regexp-replace #px"^\\s*([^\\s]*)\\s*$" s "\\1"))
>> >
>> > (define (string-trim.2 s)
>> >  (define-syntax scan
>> >    (syntax-rules ()
>> >      ((_ s start end step)
>> >       (for/first ((i (in-range start end step))
>> >                   #:when (not (char-whitespace? (string-ref s i))))
>> >         i))))
>> >
>> >  (let* ((len (string-length s))
>> >         (last-index (sub1 len))
>> >         (start (or (scan s 0 len 1) 0))
>> >         (end (or (scan s last-index start -1) last-index)))
>> >    (substring s start (add1 end))))
>> >
>> >> (string-trim.2 "a ")
>> > "a "
>> >> (string-trim.1 "a ")
>> > "a"
>> >
>> >
>> > On Sat, Apr 2, 2011 at 5:03 PM, Jon Zeppieri <zeppieri at gmail.com>
>> wrote:
>> >> Actually #rx seems to be much faster than #px (in this case, at any
>> rate),
>> >> but it's still slower:
>> >>> (test)
>> >> cpu time: 1162 real time: 1181 gc time: 40
>> >> cpu time: 230 real time: 230 gc time: 0
>> >>> (test)
>> >> cpu time: 1184 real time: 1198 gc time: 38
>> >> cpu time: 258 real time: 259 gc time: 21
>> >>> (test)
>> >> cpu time: 1220 real time: 1544 gc time: 40
>> >> cpu time: 233 real time: 233 gc time: 0
>> >>
>> >> On Sat, Apr 2, 2011 at 5:56 PM, Jon Zeppieri <zeppieri at gmail.com>
>> wrote:
>> >>>
>> >>> I was a bit surprised to find that the scanning-by-hand approach
>> really is
>> >>> significantly faster than using regexps.
>> >>> Between these two functions:
>> >>> (define (string-trim s)
>> >>>   (regexp-replace #px"^\\s*([^\\s]*)\\s*$" s "\\1"))
>> >>> ... and ...
>> >>> (define (string-trim s)
>> >>>   (define-syntax scan
>> >>>     (syntax-rules ()
>> >>>       ((_ s start end step)
>> >>>        (for/first ((i (in-range start end step))
>> >>>                    #:when (not (char-whitespace? (string-ref s i))))
>> >>>          i))))
>> >>>
>> >>>   (let* ((len (string-length s))
>> >>>          (last-index (sub1 len))
>> >>>          (start (or (scan s 0 len 1) 0))
>> >>>          (end (or (scan s last-index start -1) last-index)))
>> >>>     (substring s start (add1 end))))
>> >>>
>> >>> ... the latter is much faster. On 100000 iterations, using the test
>> >>> string:
>> >>>  "                                                      \n  \t foo bar
>> >>> baz\n                                    \r   "
>> >>> as input, I'm getting numbers like these (where the first time is for
>> the
>> >>> regexp function and the second is for the hand-scanning function):
>> >>> > (test)
>> >>> cpu time: 8003 real time: 8008 gc time: 0
>> >>> cpu time: 256 real time: 257 gc time: 22
>> >>> > (test)
>> >>> cpu time: 8028 real time: 8025 gc time: 0
>> >>> cpu time: 255 real time: 255 gc time: 22
>> >>> > (test)
>> >>> cpu time: 8418 real time: 8424 gc time: 0
>> >>> cpu time: 260 real time: 260 gc time: 22
>> >>> > (test)
>> >>> cpu time: 8390 real time: 8401 gc time: 0
>> >>> cpu time: 252 real time: 253 gc time: 20
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Apr 2, 2011 at 5:20 PM, Richard Cleis <rcleis at mac.com> wrote:
>> >>>>
>> >>>> You can use an index to the string to find the location of your goal,
>> >>>> then return the substring when you are done.
>> >>>>
>> >>>> rac
>> >>>>
>> >>>> On Apr 2, 2011, at 3:08 PM, Charles Hixson wrote:
>> >>>>
>> >>>> > This seems to be what I want the string-trim to do, but it seems
>> that
>> >>>> > all the string copying would be expensive.  Is there a way to
>> improve it by
>> >>>> > avoiding the string copying?
>> >>>> >
>> >>>> > My original inclination was to use a while loop with a test for
>> >>>> > non-whitespace, but that appears to not be something scheme
>> supports.
>> >>>> >
>> >>>> > (define (string-trim s)
>> >>>> >    (let ( (l (string-length s) ) )
>> >>>> >      (cond
>> >>>> >        [ (= l 0) #f]
>> >>>> >        [ (char-whitespace? (string-ref s (- l 1) ) )
>>  (string-trim
>> >>>> > (substring s 0 (- l 1) ) ) ]
>> >>>> >        [else s]) ) )
>> >>>> > _________________________________________________
>> >>>> > For list-related administrative tasks:
>> >>>> > http://lists.racket-lang.org/listinfo/users
>> >>>>
>> >>>> _________________________________________________
>> >>>>  For list-related administrative tasks:
>> >>>>  http://lists.racket-lang.org/listinfo/users
>> >>>
>> >>
>> >>
>> >> _________________________________________________
>> >>  For list-related administrative tasks:
>> >>  http://lists.racket-lang.org/listinfo/users
>> >>
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.racket-lang.org/users/archive/attachments/20110402/678fd33b/attachment.html>

Posted on the users mailing list.