[racket] help me speed up string split?
This might still be optimizable in pure Racket; otherwise, mixing Racket
with R might not be a bad idea for this and other reasons.
Details...
I played with this briefly late last night after emails with Ryan,
without finding a substantially faster way that still looked elegant as
Racket code. It did appear that the hit was *not* from GC (not even
when a huge list was involved, which can be bad for some GCs), but
either the number parsing or the basic file port I/O. (BTW, the
"regexp-match*" approach was more expensive than I would've guessed.)
If speeding this up were important for a consulting client, I would next
do something that didn't look elegant as Racket code, and put it in a
reusable module with an elegant interface. Offhand, I would probably
next try one of the following two approaches, and if neither of those
worked, make C extension that was called once per file: (1) byte-by-byte
read from buffered I/O with a handwritten Racket DFA, probably doing the
conversion to a floating-point number as we go; or (2) unbuffered block
reads to byte strings, sized for optimal file block I/O, and parse
numbers out of those byte strings quickly. (Just doing a quick brain
dump here, since clients need me to do different things now.)
Tools like Mathematica and R presumably have had their
read-lots-of-numbers-from-a-file made pretty fast. It's OK to call R or
Mathematica judiciously from Racket for big data and other
number-crunching purposes. (I have a consulting client who mixes these
two tools well, and currently calls out to an isolated R process on
other cores from Racket, through a stdio interface. Originally, they did
this with in-process C extensions, but separate processes is much better
for a few reasons.) Although I understand that Dr. Neil T. is well on
the way of putting more R functionality into pure Racket, so I assume
more and more people over time will be doing their numeric work in pure
Racket.
Neil V.