[racket] string-strip
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 27-12-11 18:59, Sam Phillips wrote:
> On Tue, Dec 27, 2011 at 09:55, Marijn <hkBst at gentoo.org> wrote:
>> Sometimes I wish for python-like string-strip to be a standard
>> function. For example when I want to parse a list of numbers from
>> a text-field, I could get the string value of the text-field,
>> strip space from the beginning and end, split on whitespace, map
>> string->number over it and check the result with `every'.
>>
>> Is there a better way? If not then perhaps some form of
>> string-strip could be added (perhaps under regexp-strip name?)
>> since it isn't exactly extremely simple unless you're a regexp
>> wizard.
>
> string-trim-both in srfi-13 does that.
Hi Sam,
now that you remind me of it I do recall that I figured this same
thing out a long time ago. Thanks for that!
Perhaps the docs could put a cross-reference under string-split for
srfi-13's string-trim-both.
> http://docs.racket-lang.org/srfi-std/srfi-13.html#string-trim
>
> Cheers, Sam
On 27-12-11 20:56, Neil Van Dyke wrote:
> Tangential comment: These string operations are convenient to have
> around, and they can be expedient, but composing with them can make
> for inefficient evaluation. For example, my first attempt to
> implement "string->number" (after briefly considering just using
> "read"), was 17 times faster in a shootout on my computer, compared
> to the one that was built using "string-strip" and other string
> operations. (It also helped that I used Racket regexps rather than
> pregexps.)
>
> (define (string->number-list/mk2 str) (let loop ((start 0)) (cond
> ((regexp-match-positions #rx"^[ \t\r\n\f]*([0-9]+(?:[0-9]+)?)" str
> start) => (lambda (m) (let* ((range (cadr m)) (end (cdr range)))
> (cons (string->number (substring str (car range) (cdr range)))
> (loop end))))) (else (if (regexp-match? #rx"^[ \t\r\n\f]*$" str
> start) '() (error 'string->number-list "could not parse string ~S
> at start position ~S" str start))))))
>
> If I was writing performance-sensitive code, I would also see
> whether a procedure that scanned a character at a time, and didn't
> need "string->number", was faster.
Thank you, Neil, that was quite inspirational. Your use of
regexp-match-positions is instructive (I was never sure how to use
that function and didn't consider it for this use case). I did not
measure a noticeable difference between pregexp and regexp btw and I
changed your first regexp to either #rx"^[ \t\r\n\f]*([1-9][0-9]*)" or
#px"^[[:space:]]*([1-9][0-9]*)" which I think is equivalent.
> As everyone knows, if the code is not performance-sensitive, do
> whatever is easiest to develop in a correct and maintainable way.
> But, when performance matters -- such as in code that takes a
> significant percentage of time in serving a HTTP request, or is in
> a PLaneT package that other people might use in
> performance-sensitive context -- we also have to think about
> algorithmic efficiency, as well as costs specific to the language
> implementation.
I don't think my use of this code is very performance, but I couldn't
help myself, so I looked into making it faster according to your
suggestion. What I found was that it is much slower to treat a string
as a port and then read-char from that then it is to directly index
the string. Also peek-char seems to take almost as much time as
read-char, so you're basically reading everything twice since there is
no such thing as putting back a char into a port (doesn't C/C++ have
that?). Also while for/fold is extremely nice, it isn't as efficient
as a named-let in the cases where both work. I found that out when my
code was still collecting chars into a list and later walking that
list to construct a number. Of course its better to construct the
number straight away, so you don't have to cons that list at all.
In the end I was able to construct code that is another factor 5.5
faster than your version:
(define (string->number-list/mk5 str)
(define length (string-length str))
(define (match-unsigned-integer str pos)
(define (match-digits result pos)
(if (< pos length)
(let* ((char (string-ref str pos)) (pos (+ pos 1)))
(case char
((#\0) (match-digits (* 10 result) pos))
((#\1) (match-digits (+ (* 10 result) 1) pos))
((#\2) (match-digits (+ (* 10 result) 2) pos))
((#\3) (match-digits (+ (* 10 result) 3) pos))
((#\4) (match-digits (+ (* 10 result) 4) pos))
((#\5) (match-digits (+ (* 10 result) 5) pos))
((#\6) (match-digits (+ (* 10 result) 6) pos))
((#\7) (match-digits (+ (* 10 result) 7) pos))
((#\8) (match-digits (+ (* 10 result) 8) pos))
((#\9) (match-digits (+ (* 10 result) 9) pos))
(else (values result pos))))
(values result pos)))
(define (match-first-digit pos)
(if (< pos length)
(let* ((char (string-ref str pos)) (pos (+ pos 1)))
(case char
((#\1) (match-digits 1 pos))
((#\2) (match-digits 2 pos))
((#\3) (match-digits 3 pos))
((#\4) (match-digits 4 pos))
((#\5) (match-digits 5 pos))
((#\6) (match-digits 6 pos))
((#\7) (match-digits 7 pos))
((#\8) (match-digits 8 pos))
((#\9) (match-digits 9 pos))
((#\space #\tab #\return #\newline #\page)
(match-first-digit pos))
(else (values #f pos))))
(values #f pos)))
(match-first-digit pos))
(let loop ((pos 0) (result '()))
(let-values (((unt pos) (match-unsigned-integer str pos)))
(if unt
(loop pos (cons unt result))
(if (< pos length)
#f
(reverse result))))) )
Marijn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk77S0EACgkQp/VmCx0OL2wOMwCfbEXpEd+sEXw33tCSYMRDWHii
H8EAn3WKNBjxiqBuSKFfxMhg7w1US1s2
=I4Mo
-----END PGP SIGNATURE-----