[racket] help me speed up string split?

From: Robby Findler (robby at eecs.northwestern.edu)
Date: Wed Jun 18 04:26:24 EDT 2014

I think the ruby version is smarter than your Racket version.
Specifically, if you remove the .size (but don't print the output),
things slow down by a factor of about 2 for me.

$  time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f)'
> /dev/null
real 0m1.127s
user 0m1.071s
sys 0m0.054s
$  time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).size'
655360

real 0m0.561s
user 0m0.512s
sys 0m0.047s

But meanwhile, it looks like the regexp matching is slower than in
Ruby. Below are some variations on the code that explore some speedups
for the computation you intended to happen, I believe. The middle
version (fetch-whitespace-delimited-string2) takes the same amount of
time as the version of ruby that is forced to actually compute the
list. So maybe there are still more improvements to be done to regexp
port matching, at least for certain regexps, I'm not sure.

Robby

#lang racket

(define (main fetch-whitespace-delimited-string)
 (length
  (call-with-input-file "X_train.txt"
    (λ (port)
      (let loop ()
        (define m (fetch-whitespace-delimited-string port))
        (cond
          [m
           (cons (string->number m) (loop))]
          [else
           '()]))))))

;; this one uses a regexp on the port
(define (fetch-whitespace-delimited-string1 p)
  (define m (regexp-match #rx"([^ \n]*)[ \n]" p))
  (cond
    [m (bytes->string/utf-8 (list-ref m 1))]
    [else #f]))

;; this one has a specialized loop that
;; expects exactly one whitespace char
(define (fetch-whitespace-delimited-string2 p)
  (let loop ([cs '()])
    (define c (read-char p))
    (cond
      [(or (eof-object? c) (memq c '(#\space #\newline)))
       (cond
         [(null? cs) #f]
         [else (apply string (reverse cs))])]
      [else (loop (cons c cs))])))

;; this one doesn't do any IO, just returns the same number
(begin
  (define c 655360) ;; put the number of numbers in the file here
  (define (fetch-whitespace-delimited-string3 p)
    (begin0
      (cond
        [(= c 0) #f]
        [else "2.8858451e-001"])
      (set! c (- c 1)))))

(collect-garbage) (collect-garbage) (collect-garbage)
(time (main fetch-whitespace-delimited-string1))
(collect-garbage) (collect-garbage) (collect-garbage)
(time (main fetch-whitespace-delimited-string2))
(collect-garbage) (collect-garbage) (collect-garbage)
(time (main fetch-whitespace-delimited-string3))


On Wed, Jun 18, 2014 at 1:51 AM, Ryan Davis <zenspider at gmail.com> wrote:
>
> On Jun 17, 2014, at 23:26, Robby Findler <robby at eecs.northwestern.edu> wrote:
>
>> Could it be the float parser? Do you actually get the same floats in
>> both versions?
>
> Yup. I'm getting floats for sure.
>
> The majority of the time in the original version was definitely being spent on string-split (and I confirmed that regexp-split is basically the same, even with bytestrings).
>
> I found a post about regexp-split being dreadfully slow for a big string and a tiny pattern that Matt Flatt fixed a long while ago. This might be a similar edge case? big string, tiny pattern, but tons of matches?


Posted on the users mailing list.