[racket] help me speed up string split?

From: Robby Findler (robby at eecs.northwestern.edu)
Date: Wed Jun 18 19:01:01 EDT 2014

Sorry-- I certainly don't dispute any of your comments on the relative
merit of the code. I was just trying to understand which parts of the
function touch bad performance in Racket.

Robby


On Wed, Jun 18, 2014 at 4:44 PM, Ryan Davis <zenspider at gmail.com> wrote:
>
> On Jun 18, 2014, at 1:26, Robby Findler <robby at eecs.northwestern.edu> wrote:
>
>> I think the ruby version is smarter than your Racket version.
>> Specifically, if you remove the .size (but don't print the output),
>> things slow down by a factor of about 2 for me.
>>
>> $  time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f)'
>>> /dev/null
>> real 0m1.127s
>> user 0m1.071s
>> sys 0m0.054s
>> $  time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).size'
>> 655360
>>
>> real 0m0.561s
>> user 0m0.512s
>> sys 0m0.047s
>
> It's possible I grabbed the wrong input. Doing the equivalent take(10) results in roughly the same time. Either way, ruby isn't using lazy lists or anything to cheat in the case of .size. You're probably just seeing a time discrepancy for output generation.
>
> % time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).take(10)'
> [0.0, 0.28858451, -0.020294171, -0.13290514, -0.9952786, -0.98311061, -0.91352645, -0.99511208, -0.98318457, -0.92352702]
>
> real    0m3.896s
> user    0m3.611s
> sys     0m0.281s
>
> % time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).size'
> 4124473
>
> real    0m3.886s
> user    0m3.612s
> sys     0m0.263s
>
>> But meanwhile, it looks like the regexp matching is slower than in
>> Ruby. Below are some variations on the code that explore some speedups
>> for the computation you intended to happen, I believe. The middle
>> version (fetch-whitespace-delimited-string2) takes the same amount of
>> time as the version of ruby that is forced to actually compute the
>> list. So maybe there are still more improvements to be done to regexp
>> port matching, at least for certain regexps, I'm not sure.
>
> The first one works, but takes ~55k ms. :(
>
> I could never get the second one to work. It comes back instantly w/ 0. I think my input file doesn't QUITE conform to what you're looking for. I snuck in an extra read-char in before the loop, but still no go.
>
> Either way, it's a really painful way to think about this type of task.
>
> "Read it all in, split on whitespace, convert everything to floats"
>
> vs
>
> "Create a function that takes a function, opens the input, sets up a loop, calls the given function, compares the result and if it is truthy, appends the string converted to a number to the front of a list (which must be later reversed). The passed function needs to set up a loop, read in a character, and if that character is not eof, space, or newline it loops concatinating that character with a list of previously found characters. If eof, or whitespace is discovered, it reverses the collected list and converts it to a string."
>
> That reads more like what you'd write in C rather than a beautiful functional language with a very complete library. It took me several times longer to convert your code to English as it did to write several of these variants in ruby, racket, or mathematica. While it is much more nuanced, I just don't think it comes as naturally or gracefully as the task as laid out in my head. Even my cleanest (and fastest!) versions took several iterations to come by and are a bit of a stretch:
>
>         (define path (path->string (expand-user-path "~/Desktop/X_train.txt")))
>
>         (define (gc)
>           (collect-garbage)(collect-garbage)(collect-garbage))
>
>         (define (go fn)
>           (time (take (with-input-from-file path fn) 10)))
>
>         (gc)
>         (go (lambda ()    ; cpu time: 10428 real time: 10410 gc time:  631 (consistently ~1/2 the gc)
>               (for/list ([s (in-port read)])
>                 s)))
>
>         (gc)
>         (go (lambda ()    ; cpu time:  9883 real time:  9872 gc time: 1182
>               (for/list ([s (in-port (curry read-string 16))])
>                 (string->number (string-trim s)))))
>
>         (gc)
>         (go (lambda ()    ; cpu time: 15006 real time: 14981 gc time: 1154
>               (for/list ([s (in-port (curry read-string 16))])
>                 (read (open-input-string s)))))
>
> I would like to think that the regexp engine can be sped up to the point where "read it, split it, convert it" is a valid option. Until then, I'll stick to for/list + in-port + read, but remembering that that particular combination is the right one is going to be difficult.
>
> P.S. Mathematica only takes ~8 seconds and is no speed demon for this type of task:
>
> In[2]:= AbsoluteTiming[
>  Take[Flatten[Import["~/Desktop/X_train.txt", "Table"]], 10]]
>
> Out[2]= {8.443284, {0.288585, -0.0202942, -0.132905, -0.995279, \
> -0.983111, -0.913526, -0.995112, -0.983185, -0.923527, -0.934724}}
>
> more importantly, I wrote that from memory.
>


Posted on the users mailing list.