[racket] help me speed up string split?

From: Neil Van Dyke (neil at neilvandyke.org)
Date: Wed Jun 18 18:47:47 EDT 2014

This might still be optimizable in pure Racket; otherwise, mixing Racket 
with R might not be a bad idea for this and other reasons.

Details...

I played with this briefly late last night after emails with Ryan, 
without finding a substantially faster way that still looked elegant as 
Racket code.  It did appear that the hit was *not* from GC (not even 
when a huge list was involved, which can be bad for some GCs), but 
either the number parsing or the basic file port I/O.  (BTW, the 
"regexp-match*" approach was more expensive than I would've guessed.)

If speeding this up were important for a consulting client, I would next 
do something that didn't look elegant as Racket code, and put it in a 
reusable module with an elegant interface.  Offhand, I would probably 
next try one of the following two approaches, and if neither of those 
worked, make C extension that was called once per file: (1) byte-by-byte 
read from buffered I/O with a handwritten Racket DFA, probably doing the 
conversion to a floating-point number as we go; or (2) unbuffered block 
reads to byte strings, sized for optimal file block I/O, and parse 
numbers out of those byte strings quickly.  (Just doing a quick brain 
dump here, since clients need me to do different things now.)

Tools like Mathematica and R presumably have had their 
read-lots-of-numbers-from-a-file made pretty fast.  It's OK to call R or 
Mathematica judiciously from Racket for big data and other 
number-crunching purposes.  (I have a consulting client who mixes these 
two tools well, and currently calls out to an isolated R process on 
other cores from Racket, through a stdio interface. Originally, they did 
this with in-process C extensions, but separate processes is much better 
for a few reasons.)  Although I understand that Dr. Neil T. is well on 
the way of putting more R functionality into pure Racket, so I assume 
more and more people over time will be doing their numeric work in pure 
Racket.

Neil V.


Posted on the users mailing list.