[racket] string-strip

From: Marijn (hkBst at gentoo.org)
Date: Wed Dec 28 12:00:49 EST 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 27-12-11 18:59, Sam Phillips wrote:
> On Tue, Dec 27, 2011 at 09:55, Marijn <hkBst at gentoo.org> wrote:
>> Sometimes I wish for python-like string-strip to be a standard 
>> function. For example when I want to parse a list of numbers from
>> a text-field, I could get the string value of the text-field,
>> strip space from the beginning and end, split on whitespace, map 
>> string->number over it and check the result with `every'.
>> 
>> Is there a better way? If not then perhaps some form of
>> string-strip could be added (perhaps under regexp-strip name?)
>> since it isn't exactly extremely simple unless you're a regexp
>> wizard.
> 
> string-trim-both in srfi-13 does that.

Hi Sam,
now that you remind me of it I do recall that I figured this same
thing out a long time ago. Thanks for that!
Perhaps the docs could put a cross-reference under string-split for
srfi-13's string-trim-both.

> http://docs.racket-lang.org/srfi-std/srfi-13.html#string-trim
> 
> Cheers, Sam

On 27-12-11 20:56, Neil Van Dyke wrote:
> Tangential comment: These string operations are convenient to have 
> around, and they can be expedient, but composing with them can make
> for inefficient evaluation. For example, my first attempt to
> implement "string->number" (after briefly considering just using
> "read"), was 17 times faster in a shootout on my computer, compared
> to the one that was built using "string-strip" and other string
> operations.  (It also helped that I used Racket regexps rather than
> pregexps.)
> 
> (define (string->number-list/mk2 str) (let loop ((start 0)) (cond
> ((regexp-match-positions #rx"^[ \t\r\n\f]*([0-9]+(?:[0-9]+)?)" str 
> start) => (lambda (m) (let* ((range (cadr m)) (end   (cdr range))) 
> (cons (string->number (substring str (car range) (cdr range))) 
> (loop end))))) (else (if (regexp-match? #rx"^[ \t\r\n\f]*$" str
> start) '() (error 'string->number-list "could not parse string ~S
> at start position ~S" str start))))))
> 
> If I was writing performance-sensitive code, I would also see
> whether a procedure that scanned a character at a time, and didn't
> need "string->number", was faster.

Thank you, Neil, that was quite inspirational. Your use of
regexp-match-positions is instructive (I was never sure how to use
that function and didn't consider it for this use case). I did not
measure a noticeable difference between pregexp and regexp btw and I
changed your first regexp to either #rx"^[ \t\r\n\f]*([1-9][0-9]*)" or
#px"^[[:space:]]*([1-9][0-9]*)" which I think is equivalent.

> As everyone knows, if the code is not performance-sensitive, do
> whatever is easiest to develop in a correct and maintainable way.
> But, when performance matters -- such as in code that takes a
> significant percentage of time in serving a HTTP request, or is in
> a PLaneT package that other people might use in
> performance-sensitive context -- we also have to think about
> algorithmic efficiency, as well as costs specific to the language
> implementation.

I don't think my use of this code is very performance, but I couldn't
help myself, so I looked into making it faster according to your
suggestion. What I found was that it is much slower to treat a string
as a port and then read-char from that then it is to directly index
the string. Also peek-char seems to take almost as much time as
read-char, so you're basically reading everything twice since there is
no such thing as putting back a char into a port (doesn't C/C++ have
that?). Also while for/fold is extremely nice, it isn't as efficient
as a named-let in the cases where both work. I found that out when my
code was still collecting chars into a list and later walking that
list to construct a number. Of course its better to construct the
number straight away, so you don't have to cons that list at all.

In the end I was able to construct code that is another factor 5.5
faster than your version:

(define (string->number-list/mk5 str)
  (define length (string-length str))
  (define (match-unsigned-integer str pos)
    (define (match-digits result pos)
      (if (< pos length)
	  (let* ((char (string-ref str pos)) (pos (+ pos 1)))
	    (case char
	      ((#\0) (match-digits (* 10 result) pos))
	      ((#\1) (match-digits (+ (* 10 result) 1) pos))
	      ((#\2) (match-digits (+ (* 10 result) 2) pos))
	      ((#\3) (match-digits (+ (* 10 result) 3) pos))
	      ((#\4) (match-digits (+ (* 10 result) 4) pos))
	      ((#\5) (match-digits (+ (* 10 result) 5) pos))
	      ((#\6) (match-digits (+ (* 10 result) 6) pos))
	      ((#\7) (match-digits (+ (* 10 result) 7) pos))
	      ((#\8) (match-digits (+ (* 10 result) 8) pos))
	      ((#\9) (match-digits (+ (* 10 result) 9) pos))
	      (else (values result pos))))
	  (values result pos)))
    (define (match-first-digit pos)
      (if (< pos length)
	  (let* ((char (string-ref str pos)) (pos (+ pos 1)))
	    (case char
	      ((#\1) (match-digits 1 pos))
	      ((#\2) (match-digits 2 pos))
	      ((#\3) (match-digits 3 pos))
	      ((#\4) (match-digits 4 pos))
	      ((#\5) (match-digits 5 pos))
	      ((#\6) (match-digits 6 pos))
	      ((#\7) (match-digits 7 pos))
	      ((#\8) (match-digits 8 pos))
	      ((#\9) (match-digits 9 pos))
	      ((#\space #\tab #\return #\newline #\page)
	       (match-first-digit pos))
	      (else (values #f pos))))
	  (values #f pos)))
    (match-first-digit pos))
  (let loop ((pos 0) (result '()))
    (let-values (((unt pos) (match-unsigned-integer str pos)))
      (if unt
	  (loop pos (cons unt result))
	  (if (< pos length)
	      #f
	      (reverse result))))) )

Marijn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk77S0EACgkQp/VmCx0OL2wOMwCfbEXpEd+sEXw33tCSYMRDWHii
H8EAn3WKNBjxiqBuSKFfxMhg7w1US1s2
=I4Mo
-----END PGP SIGNATURE-----


Posted on the users mailing list.