[racket] help me speed up string split?

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Thu Jul 24 11:27:36 EDT 2014

Sorry for the long delay, but I finally took a look at this.

Yes, `string-split` is mistuned. More specifically, `regexp-match`
(which is used by `string-split`) is mistuned for a small match in a
large string. The problem is in the UTF-8 decoding phase, since the
regexp engine works on bytes instead of characters. The matcher starts
out decoding up to 1024 bytes of the input string --- and that's bad
when `string-split` calls `regexp-match` many times on successive
suffixes of a large string.

After some experiments, I've changed the initial decoding chunk to 32
bytes. The decoding chunk grows as more bytes are needed, so a smaller
initial size doesn't have a big effect on large matches, but it has a
big effect on many small matches. Smaller numbers, such as 16, don't
provide any further improvement for small matches.

Your example program now runs in 7.5 seconds instead of 18 seconds on
my machine. That leaves a lot of room for improvement, mostly still in
the set-up and tear-down for regexp matches on strings, but nothing so
obvious and easy as adjusting the decoding chunk.

At Tue, 17 Jun 2014 21:27:54 -0700, Ryan Davis wrote:
> I've got the following code that is stumping me. My original code was in ruby 
> and only took 3.9 seconds. I tried to write the equivalent in racket and was 
> surprised when it came in at 25 seconds. Got some help in IRC and got it down 
> to ~12 seconds by cheating using read. I'm guessing I'm missing something. All 
> timing was done in emacs w/ racket-mode's repl. I figure variations < 2s are 
> due to GC or trivial differences. Suggestions?
> 
> #lang racket
> 
> (require (only-in 2htdp/batch-io read-words))
> 
> (define (fast-read path)
>   (with-input-from-file path (lambda () (port->string))))
> 
> (define (fast-bytes path)
>   (with-input-from-file path (lambda () (port->bytes))))
> 
> (define path (path->string (expand-user-path "~/Desktop/X_train.txt")))
> 
> (time (take (map string->number (read-words path)) 
> 10))                                ; 25154 ms
> (time (take (map string->number (string-split (fast-read path))) 
> 10))                  ; 27195 ms
> (time (take (read (open-input-string (string-append "(" (fast-read path)   
> ")"))) 10)) ; 13683 ms
> (time (take (read (open-input-bytes  (bytes-append #"(" (fast-bytes path) 
> #")"))) 10)) ; 11930 ms
> 
> ;; 66,006,256 bytes in the file
> (string-length (fast-read path))
> ;; 4,124,472 floats in ascii format
> (length (read (open-input-bytes (bytes-append #"(" (fast-bytes path) #")"))))
> ;; File contents look like:
> ;;  2.8858451e-001 -2.0294171e-002 -1.3290514e-001 -9.9527860e-001 
> -9.8311061e-001
> 
> ;; chicken
> ;; (require 'utils)
> ;; (time (length (read (open-input-string (string-append "(" (read-all path) 
> ")")))))
> ;; 11.9 s
> 
> ;; ruby
> ;; % time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).size'
> ;; real 0m3.882s
> 
> 
> ____________________
>   Racket Users list:
>   http://lists.racket-lang.org/users

Posted on the users mailing list.